Technical debt is a commonly used metaphor in IT, it refers to shortcuts that make work more difficult in the long term. While we hear a lot about it in software development, it’s rare to find someone talking about sysadmin technical debt. Few might realize that it’s sources can be drastically different than more “popular” software technical debt.
The difference pops up because system administration work is different from programming. In programming, you’re making something, adding a feature, or fixing a bug. In administration, you mix already available tools and hardware in order to create systems that satisfy requirements. Also, sysadmins are expected to keep those systems running indefinitely. Not really forever, but usually at least for a long undefined period of time.
Preferably your systems should never, ever, and not even then go offline. Which makes production problems quite stressful and hard to fix. And unlike in programming, the quick and dirty fix to keep it online is harder to clean up later. Playing around production tends to break production. And quite many changes require an app/server restart which for production usually requires scheduling and notifying people. This sort of social activity is generally frowned upon by sysadmins. Meaning that it never happens, or that sysadmins wait for the next big outage to use as an opportunity to play around production.
Major pain points of sysadmin technical debt
All technical debt is bad, and sysadmin technical debt is no different. Bellow is a few cultural and organizational practices that massively increase technical debt. You are probably having a lot of sysadmin technical debt if you have to constantly poke sysadmins to get anything done.
A snowflake system is a system that exists as a “one-off” without easy reproducibility. It’s often hard to change and easy to break. The name comes from Special snowflake, generally meaning an easily upset and delusional person. When no one really knows how to change a system, and when that system breaks on every change we call it “snowflake”.
Because of their fragility, they tend to easily break, and for the same reason, every hotfix is hard to make more proper later on. They simply end up as a card pyramid with years of hotfixes that somehow keep everything together. As long as you don’t touch it, look at it, or walk too loudly next to its server. There are two major ways of addressing such systems are making them reproducible and remaking them in a more stable environment.
- Reproducibility will allow you to redeploy the previous “working” version. This means that you will be free to change and experiment because you will have the rollback in case things go bad. The current hip way to deal with this is by using Infrastructure as a code to automate deployments. Actual implementation can be through GitOps, Ansible, puppet, or any other IaC provider.
- Stabilizing an environment may end up being hard or even impossible. Let’s face it some software is just too ancient or messy to be maintainable without years of expertise. Mail servers and LDAPs are notorious for their complexity. At the very least ensure that the system gets documented with design decisions, tradeoff justifications, and hotfixes. Then just slowly move it towards a cleaner state whenever possible.
Commonly this is expressed by sysadmins jealously restricting access to servers and services that they maintain. This is quite common in sysadmin cultures that have a long tradition of one-man sysadmin teams. The same goes for organizations who never really considered upgrading the IT-crowd like few admins in a basement setup
In short term, extreme territorialism impairs the ability of sysadmins to learn from each other. Long term, it will create technological bubbles that evolve completely separately from each other. In extreme cases, whole projects and services are dependants on the time available to a single person. This puts the organization at risk in the case when such an individual becomes unavailable due to other responsibilities, disease, or job change.
Never shutting down legacy systems
Never shutting down legacy systems will slowly erode your admin’s free time. Any new requests will take longer to start. Meaning that sysadmins will get an ever-increasing to-do list. Having more and more systems that are never challenged is a sure recipe for accruing sysadmin technical debt. While most systems are stable enough to work unattended for 5-10 years, eventually those forgotten servers and VMs will require maintenance. Also all along they will use electricity, resources and will require occasional hardware checkups and replacement.
From the sysadmin perspective, there is no useless service. They are almost never the intended users and stakeholders for a service and thus they assume it needs to run. But sometimes the cost of maintaining such an application can be much greater than getting few people who use it to move to something more modern. Don’t get me wrong, there are legitimate reasons for running 5, 10, 15-year-old legacy systems, but it’s dangerous and costly to just assume that such systems must run without giving it a second thought.
The best approach would be to have annual “should this live” reviews, followed by upgrading, migrating, archiving, or shutting down unwanted services. Or alternatively when defining new projects and services, also specifying an estimated time to live after which their usefulness should be reconsidered
One man – whole IT departent myth
We want to feel like heroes. We want to be like K&R who while making Unix created the C language to make Unix easier to write. But sadly, heroes are hard to find and impossible to replace. If you are the one hero admin in your organization, you are also the biggest threat to the stability and progress of your systems. Also being the one, means that you can never truly relax on a vacation or get ill because the systems might go down. Or if you do, uptime or project progress will suffer.
In the ideal world, you should strive to have atleast 3 persons with login access to each system. And the ability to do basic diagnostics, safe restart, or minor common tweaks. For the extra effort, you have to make sure that those 3 people never travel together by plane.
Want to learn more about technical debt?
This article on sysadmin technical debt is just one in the series of articles on the topic. Check out other: