What can increase DevOps technical debt

DevOps

20/06/202108/11/2021

Technical debt is a topic that pops up often in the IT world. Simply stated, it’s a metaphor for all the shortcuts that you take during development that later on hurt your ability to quickly produce or modify code. Often we hear about it in the context of DevOps and CI/CD. So what is DevOps technical debt?

DevOps is significantly different from programming or system administration. DevOps in general works on optimizing and automating processes inside the company. You might disagree with this and say things like “DevOps sets up pipelines and automation and stuff” but in reality all he does is say “Say, how about we stop mocking around thease manual builds and tests and just save the code and let the automation take over?”. Result might be the same, but the intent is that it makes a difference. At Least the difference between a successful DevOps and one who merely follows online guides hoping for the best.

And for this very reason the technical debt from DevOps perspective is different from the more classical programming technical debt. While this is not a complete list of all possible ways you can accumulate technical debt by doing DevOps work, it will definitely be a good starting point for optimizing the work of DevOps inside your organization.

What increases DevOps technical debt?

Set & Forget fallacy

One of the biggest misconceptions with automation and CI/CD pipelines is that you can set them up and completely forget about them. This can sound very counterintuitive. You have a system that does something, and if nothing changes one year from now it should still behave the same. Right?

Well not really. The first issue in this statement is that we assume “nothing changes” and also mention the passing of time. Now I don’t know about you, but for me, time definitely passes and irrevocably changes things. Meaning codebase changes, 3rd party dependencies (packages, libraries, modules, supporting applications) changing or disappearing. APIs, developers, and development platforms are slowly tweaked and replaced.

Anyhow, the point is no matter how careful you are, the world will still move with time. Eventually, your systems will need to be tweaked to match reality. Besides reality, your user’s needs and habits will change, and automation has to follow the stakeholder needs.

And by your users, I mean the development teams directly and indirectly supported by DevOps. The only real measure of DevOps is the satisfaction of the teams that depend on it. Fancy terms, buzzwords, and sales points are pointless if you’re not satisfying the needs of your teams. Your job is to save their time, brainpower, and nerves through automation.

No matter how much you try to keep the status quo, things will slowly imperceptibly change. And eventually those changes will accumulate to a point where the pipelines might fail or behave unpredictably.

Deploying only complete automation

Another case of misguided DevOps effort comes from thinking about CI/CD pipelines only in their final form. Not considering using partial automation, and waiting for all its parts to become automated. This is fine for new projects, as often the structure evolves with needs of the project. But it can be very harmful when you’re first starting to introduce DevOps practices in a company.

The waiting process can seriously harm developer adaption. Automation becomes liberating at the point when developers stop consciously thinking about automated processes and simply push their code and move to the next task. For this they don’t need a complete pipeline, automating any part of the system can give them a “you do your part, and automation will take over” veeling. If it takes a super long time to start “tasting” DevOps benefits, they will simply become resistant to the concept in general. The same goes if they have to do everything by hand due to a broken pipeline. They will simply start avoiding and resisting any further automation.

Overcomplicated pipelines

In general pipelines and helper scripts and templates should be as simple as possible in order to be easily editable by developers themselves. This also means that we shouldn’t optimize them towards the highest possible coding standard but for ease of access. Afterall they will be opened by cranky angry coders who just want to fix their suddenly broken popeline and go back to their own work and worries.

In the late stage of Ci/CD pipelines, developers fully depend on automation to do everything correctly. This means that you might have teams that are not aware of the details of the build and deploy processes driving their projects. For this reason, having one simple file that can be easily copied/run by hand with a few helper scripts is the simplest way to involve your team in the DevOps work.

Too many moving parts increases DevOps technical debt

Automation with too many small steps can make your life very difficult. Often the symptom for this is an overzealous microservice sprawl. The logic goes as such: ‘each microservice does only one thing, therefore for each thing we need a separate microservice’. And you end up with 10-15 steps to just build and test your app, each starting; ending, and handing off to the next one. And this is fine and dandy, except when something fails you have to dig down to find where the problem is.

This is a common sort of conflict between programmers. Is generating fake SSL certificates one step/microservice? Or is generating CA, signing the cert, CSR, key and bundle all separate steps/microservices? This kind of behavior can very quickly turn your pipelines into an unmaintainable mess.

It’s a good policy to periodically review all the CI/CD pipeline steps in order to consolidate and remove unnecessary steps. This can make your pipelines faster as well as reduce the number of steps that have to be maintained.

What happens when automation breaks

Every project and deployment system should have a plan B. Like every IT service, CI/CD pipelines will occasionally fail. This can cause serious workflow disruption. In order to avoid this the pipeline should be easily replicated by hand. This will make it easy to temporarily mitigate any disruption to the pipeline caused by server and pipeline issues..

Besides the regular “who will fix it?” question, we also have to think about who will do it by hand while we wait? And it’s essential to always have team members who can take over when automation fails. And to achieve this, you have to get them to agree on who’s doing what and get them to occasionally do it by hand. Also, don’t automatically expect that this will be done by DevOps. This sort of outage usually happens when DevOps is on holiday or sick leave. Anyway, DevOps is supposed to be fixing the pipeline and not muck about pretending he is the pipeline.

When teams start trusting automation, its important to know the limits of that trust. As anything in IT, pipelines will occasionally fail, and when they do you should have a backup to take over. This can take a form of a simplified system, a bunch of manual scripts or a everything done by hand approach. It is important to agree in advance who will do what and occasionally practice doing everything by hand.

Automating old system vs. designing the system from zero

In many projects, the build & deploy process is a patchwork of habit, circumstance, and rituals that work for now. It is overseen by high priests to whom code and sacrifice are presented by lowly servants with trembling hands. Just because you might call him CTO, senior dev or administrator doesn’t change the reality. He’s is doing a few things in precisely the right order and all will be fine. The things are not that hard themselves, but mucking it up can take a long time to fix.

Most developers are reluctant to touch or change the build and deploy the system. And in longer projects, they just add stuff when needed, and rarely consider refactoring or replacing excessive parts. This means that when you start automating an older project there is an opportunity to cut away and simplify the processes. Don’t get me wrong, replicating the current process is the first step. With some extra effort, you can also significantly increase the speed and robustness of the process.

Every project a snowflake

Each project under automation is another project on the maintenance list of the DevOps team. While it’s perfectly fine to tweak the CI/CD to fit the project needs, you should try to consolidate and standardize as much as possible. The best way to do that is to provide an easy way to just turn on the “standard procedure”.

Besides reducing deployment and maintenance, standardization also helps increase developer awareness. First they will know what to expect from the system, which means that for each new project they have 0 learning curve. And second, if they understand how it works under the hood, they will be able to understand how all standardized projects pipelines work. This means that they will be more likely to tweak things themselves, or at least be able to more directly communicate their needs.

Fear of continuous deployment

It is very common for teams to fear using a completely automated deployment system. Usually its either retaining partial manual deplyment, or by asking for additional signoff or checkmark before deploying to production. While this might sound sensible, after all no one wants unchecked buggy code to end up in production, it’s a misguided effort as it guards against an unlikely scenario (for most teams at least).

Automation should make it easy to deploy stuff. Ideally it should also test and check in order to catch any issues before they get to production. But it’s unavoidable to sometimes push buggy code, the automation should provide an easy way to roll back to an older known good version of the app.

Instead of gatekeeping, time should be invested in a robust rollback feature. This way you can easily undo the changes if bad code reaches production. If you think about it, you are wasting countless hours of your team in order to prevent a once-in-a-few-year service disruption. And the worst part is that the simple rollback would minimize it to mere minutes, giving you ample time to fix the code before pushing it again. Charity.wtf wrote a great blog post on this topic.

Want to learn more about technical debt?

This article on sysadmin technical debt is just one in the series of articles on the topic. Check out the rest of the series: