Tom Limoncelli (yes, that Tom) recently wrote a blog post that came to my attention by way of Twitter in which he lamented his bank's scheduled downtime and the implications of routine "weekend work" in terms of an organization's respect for the time and work-life balance of its sysadmin staff.
This was posted the "Rants" section of his blog and is obvisouly ment to be taken as slightly tongue-in-cheek alongside the idea that every sysadmin in geekdom's creation would really rather be watching the Star Wars movie, but it's broadly representative of an attitude I've seen emerging more and more in our profession: That sysadmin work should be viewed as a 9-to-5 gig. I in turn ranted a little bit about that on Twitter, but I think it merits following up with a longer form discussion, so let's have a blog post before the end of the year!
First off I want to say that at the most basic level I assume Tom and I are in full agreement on two key points (and if we're not I hope he'll correct my assumption):
- Downtime should not be a regular occurrence
Most companies can do simple maintenance without taking outages, or even having the risk of an outage, but banks (terrible examples of everything that they are) are constantly having outage windows. My bank's online banking services are routinely down or degraded on almost every weekend, and I think it's fucking ridiculous in the 21st century.
- Springing nights-and-weekends work on your sysadmin staff is disrespectful and shortsighted
This is a corollary to #1: If you can do most of your work without needing (or risking) an outage that means you can do most of it during regular business hours. Any nights-and-weekends work that's required should be scheduled out well in advance so as not to screw up your staff's out-of-hours plans too much: You're already taking what should be their time, so try not to be too much of a dick about it!
Where I diverge from Tom's rant, and the view of many of my peers in the sysadmin world, is that many system administrators seem to expect their job to be a "regular 9-to-5", without the need to put in work outsie of normal business hours. This manifests itself in many ways: People are horrified to learn that I patch our production systems at 5:30 PM on a Friday by choice and wonder aghast if I've not heard of the "Read-Only Fridays" rule. Colleagues loudly proclaim that they would never do a deployment on a weekend. Blog posts like Tom's imply that to schedule such work on a weekend is nothing short of a slap in the face to the entire IT staff.
Frankly folks, I'm calling bullshit. As a sysadmin I fully expect to have to work nights and weekends at least some of the time.
The simple fact is that system administration is providing an infrastructure service: Much like the water and power company we're the ones who make sure all the plumbing and wiring is in good shape so the day-to-day work can get done. Also much like those municipal utilities we have two sorts of customers: The end users (roughly equivalent to you and I turning on the lights or water at home) and the developers (the municipality itself - providing water to the fire hydrants and electricity to the street lights).
We need to keep our customers happy, which means making sure the systems we're responsible for keep chugging along smoothly 24x7x365: When someone turns on the tap at home they damn well expect cool clear fresh water to come out of it, and when someone visits their bank website they similarly expect their account information to flow to their computer.
Now obviously no system can have 100% availability: Even your municipal water and power systems have maintenance ourages! Accordingly a key part of keeping our users happy is scheduling work at a time when it will be least inconvenient for them. The water company flushes the mains in residential areas in the middle of the day (when most people are at work and won't notice the water pressure drop at home), and flush commercial areas in the evening (after most people have knocked off and gone home).
What does that have to do with system administration? Plenty!
The principle is that work is scheduled when it will impact the fewest users. For a bank that means "Weekends". (We live in a 24x7x365 world, but banking still takes place on the old banker's clock: Transaction volume drops off sharply on the weekends, so if information on the website is a few hours old or an ACH payment doesn't get issued until Monday it's OK: None of that activity was going to settle until Monday anyway!)
For Netflix it probably means "In the middle of the day on a weekday" because most folks are at work and not on their couch streaming movies.
Every company should know its users, and its usage patterns. For my company I can tell you with absolute certainty that weekends are our lowest volume periods, as this delightful unitless graph will attest:
Note the valley at the end of each week: So when we're doing something that may result in an outage or significantly degraded performance we do it on the weekends. If that means we're not observing Read-Only Friday for a given week or that I might have to be working for the majority of a weekend, well that's just part of the job!
As Mark pointed out on Twitter, "we do it right if we act like we're in the hospitality industry." – Good sysadmins will go out of our way to accommodate our users and ensure that their needs are met, even if it means we're occasionally working on a weekend.
Now there's an follow-up to everything I just said about how we need to bend over backwards to accommodate our users: The company needs to respect their staff's time, as Tom rightly pointed out.
We don't expect production systems to be sacrosanct during business hours: If a change cannot cause an outage then by all means do it during business hours when everyone is here to help if something goes wrong. As long as the users won't be affected everything is fair game, and well over 90% of our routine maintenance, patching, and deployments can be done during business hours.
When we must schedule work on a weekend it is just that: Scheduled. It's known 2-4 weeks in advance at a minimum, and we make sure that it's as convenient as possible for every employee who will need to be available in order to complete the planned work. If something comes up and the scheduled date is no longer convenient we will postpone the work, even if that means contacting end users and telling them the date has been changed.
Similarly when folks have to spend the entire weekend working on our environment we recognize that they're making a personal sacrifice: The company has taken more of their time than originally contracted for, and we owe them something in return - usually comp time to be taken at the employee's leisure: Make next weekend a three or four-day weekend, or skip Monday so you have a day to decompress.
Sysadmins must respect the fact that we are ultimately providing services to our users and strive to minimize the impact to the user in everything we do, but companies must recognize the fact that their sysadmin staff is a critical resource in being able to provide those services. The staff's personal time, especially weekends and holidays, must be respected and prioritized the same way the end user's experience is.