View markdown source on GitHub



last_modification Published: Jan 31, 2019
last_modification Last Updated: Apr 6, 2021

Part of the Job

We are not just janitors / code monkeys

We’re (often) the point of contact for any problems

Speaker Notes

We need to talk and not just talk but communicate with each other, listen to each other’s complaints and problems. You can only control 1/2 of the equation, but you can have a positive impact on the balance.

You’re also often the Tier 1 response to issues, so even when you’re stressed, users often just want to feel heard, like they’ve told someone and their complaint has been heard. Doing this can help defuse bad situations where people are upset and feelings are hurt.

SLAs vs SLOs

EU SLOs (not SLAs)*

Service SLO Permitted Outage (30d) Critical User-Facing?
Haproxy 99.9% 43 minutes Yes Yes
Cluster 95.0% 36 hours Yes No
Sentry 50.0% 15 day No No
Jenkins 50.0% 15 day No Yes-ish
Grafana 50.0% 15 day No Yes-ish


* SLA = Service Level Agreement, SLO = Service Level Objective


Speaker Notes

Service Level Objectives are just your goals, not a legal agreement with your users. It’s just a number that you plan to hit, that you can share with your users to give them an idea of your goals.

Bad things happen

Speaker Notes

Reliability engineering is difficult


Communication Communication Communication

Speaker Notes

Common advice: respond calmly, do not send emails immediately, come back to it in a few hours

Shared Responsibility

Documentation Documentation Documentation

Speaker Notes

Have you ever had to debug something that was broken, late at night? And it was broken and someone else’s fault and it didn’t behave how it was documented? And you were up late swearing up and down at this other person for their bad code that’s causing you to lose sleep?

We all do it to each other, we just have to try to be better. It’s an uphill battle, and it’s difficult every day to write that documentation and force yourself into these good habits.

Small Changes, Big Impacts

EU User Registrations / Week EU Jobs / Week

Be careful with what changes you make in production; all of your users are depending on you to do their work

Speaker Notes

Any change you make to a production system is multiplied by the number of active users. As of this writing EU has ~800 monthly active users. Even small changes affect everyone so be careful and have a dev environment to deploy to first.

As an example, we had a partially working system, it was working for 95% of our users on friday night. One of our admins made a change in production to try and fix the 5%, but instead it took the system down for everyone. This is the sort of example of “be really careful” that is illuminating, not only was the service now offline for everyone (unlike the 95% functional case where we could have left it until monday morning), but the admin had to spend another hour reverting the changes.

Self Care

Take care of yourself! Being a sysadmin is a difficult and thankless job, do not do it at the expense of your mental or physical health:

Area Suggestions
Mental Health Take breaks, walk around, clear your head. Do not skip lunch with coworkers, join them.
External Environment See sunlight regularly, living in caves is bad for you
Office Environment Request a standing desk, or using an exercise ball in lieu of a seat
Stress You (probably) work in an academic environment, if the service is down, it’s down. You have SLOs and not SLAs, and if you miss them one month it is ok. Respect your work-life balance.

Speaker Notes

Your health is more important than the health of the servers.

Take Home

Explicit Goals:

  1. You’re happy
  2. Your users are happy

It is not a zero-sum game, both of these can be true (or true enough), and both of these are worth working towards.


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! page logo Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.