I don’t know about you, but over the past year I’ve really started to feel like software quality is gradually degrading. Whether it’s Cloudflare’s semi-regular outages, AWS going down for almost a day last October, or Microsoft’s botched Windows 11 update, it feels like high profile mistakes are becoming more and more commonplace.
Whilst they are impactful, these world-changing blunders really are just the tip of the iceberg.
Major failures in software development don’t come from nowhere. They’re often the result of many minute problems building up over a long period of time - placing ever more delivery pressure onto those at the coalface in the form of tiny inconveniences and compounding workarounds. Eventually, enough corners are cut and processes circumvented that something gives.
This is a blog about how systems die1, what circumstances lead up to it happening, and the wider climate which software engineers currently find themselves in.
The days more gray each one than what had gone before
When Elon Musk burst through the doors of Twitter’s head office in October 2022 brandishing a bathroom sink, the narrative for developers changed. Within weeks, he’d laid off roughly half of Twitter’s entire workforce, a sizeable proportion of which were engineers. The media went into a frenzy, predicting a chaotic meltdown of the platform.
Yet here we are, almost four years on, and Twitter (now X) is still going strong2.
Many observers saw that a mass culling of staff didn’t result in the company’s imminent implosion and took note. This marked a major change in the way that engineers were viewed by the wider industry.
Prior to 2022, it was always possible to see which projects larger firms like Microsoft were focused on. Whilst many companies allocated their capital to the most important projects, the magnificent seven companies allocated their best developers3. The likes of Google and Amazon would hire top talent and make a role for them just so that a competitor couldn’t snap them up. High quality engineers were assets.
These days engineers are considered expensive liabilities.
Whilst this downbeat sentiment was seeded via Musk’s acquisition of Twitter, it really began to take hold in early 2023 as LLMs emerged at the forefront of public consciousness. Many technology firms had over-hired during Covid, and the narrative that AI was replacing developers formed a convenient means to not only push their next big platform, but also to reduce their bloated workforces without knocking the share price. The sudden reversal of hiring policy by these behemoths left a wealth of previously-unavailable talent open to work in the wider market, and the prevailing line that AI would replace developers only served to further weaken demand for them4.
Hiring freezes have become common across the industry in recent years. Whether reasoning is driven by a genuine belief that AI can replace certain developers, or simply by an opportunity to reduce headcount expenses without impacting investor sentiment, this gradual ratcheting up of pressure and responsibility on developers can be extremely damaging for the systems that they maintain. Unfortunately, software doesn’t fall over on the day that those people walk out the door. It fails in a far more insidious way.
If trouble comes when you least expect it, then maybe the thing to do is to always expect it
When you really think about it, it is rather absurd that anyone would expect a piece of software to self-destruct just because those developing it walk away. By its very nature, half-decent software is designed to keep running over and over, without intervention and without failure5.
This premise does beg the question… if it runs by itself, what are engineers doing all day long?
If you have a system that rarely fails and is never altered, then running on a bare-bones development team is fine. But most systems are constantly subjected to change and updates. This doesn’t always mean feature requests either. It can come in the form of security fixes, library/framework deprecation, hardware changes, vendor updates, and corporate restructuring. The list is endless. Modern applications are deeply entangled with dozens of external dependencies, each evolving at its own pace.
One clear measurement of a team’s quality is how quickly, safely and seamlessly they can cope with these changes.
The art of software engineering is not just about delivering features, it’s also about doing so in a way that makes future changes faster and easier. For any given update, any developer worth their salt will weigh up their speed of delivery against the impact that change will have on the speed of future deliveries. The never-ending firehose of feature requests rarely runs dry, so the more pressure a developer is under, the more they will deliver code that slows the rate of future change.
Pace now costs you pace later.
Always.
Teams constantly battle to find balance between supporting what already exists and building what comes next, with every single new feature incurring an ongoing support cost. Every feature added. Every change made. Every new application built. How much of a support cost a change carries often depends on how fastidiously it was implemented. Crunching to deliver faster on the current project leads to more maintenance debt when the next one rolls around, which in turn means less time to spend on writing the shiny new stuff.
If you break little promises, you’ll break big ones
So then, let’s set this ongoing struggle of now-versus-future against the backdrop of hiring freezes and job cuts.
Your team are busy delivering code, on top of your support burden, and have committed to delivering the next big project a few months from now. The company decides to cut some members from the team6.
In a perfect world, these cuts are taken into account by the project managers and those with the power agree to extend the delivery dates accordingly. In reality this is rarely the case. This can be for a plethora of reasons. Yes, it can be due to an unwillingness to miss deadlines by those that set them. But I’ve just as often seen teams try to hit the original timelines out of fear of failure or simply personal pride.
The deadline stays the same but the team shrinks. Each developer increases their pace, and that brings with it cut corners and future support burden. Maybe the team still hits the deadline, but it’s not in that delivery that the cracks begin to show - the team has slightly less time to deliver the next project because they took on a little too much debt.
These compromises rarely look dramatic in isolation. A refactor gets postponed so that a deadline can be met. An integration test suite is trimmed because it’s slowing down deployments. A dependency upgrade is deferred because the migration will take too long. None of these decisions seem catastrophic on their own, but each one leaves the system slightly harder to change than it was before.
If a team is good, and their conditions favourable, then they’ll be able to right the ship without anyone noticing. But when that team exists in a world which sees engineers as an expense to be minimised, those little hairline cracks begin to compound, each one making another more likely.
These compounding costs can sometimes become so substantial that a team finds themselves overcome with support burden - these teams are commonly referred to as “in crisis”. Their maintenance cost has become so high that they can do nothing else but tend to it. Every day is a struggle to stand still. For those that find themselves in this position, there are only two ways to come back from the brink: outside help or letting things burn whilst they try to erode some of the debt. If a team in crisis is unlucky enough to exist at a time like this, with an emphasis on reducing developer headcount, then it is extremely unlikely that they’ll receive outside help.
This is how software dies.
Slowly.
Incrementally.
From the inside out.
Grinding to a halt with those who work on the system unable to maintain the weight they must carry.
That’s not the whole story though. These cut corners bleed out into the lives of those that depend on the withering system. The UI gets slightly slower than it used to be. The API contract is a little bit harder to work with than it should be. The error codes are less intuitive to diagnose than normal. Every small degradation the system experiences makes it that tiny bit more difficult for teams that depend on the software to deliver what they need to. In turn, those teams have less time to deliver, so they take on more support burden than they should. Now it’s not just one team sliding ever-closer to crisis mode.
In an industry that doesn’t value the engineers that built it - intent on cutting costs and reducing headcount - the infection is not contained to a single company.
The number of these microscopic degradations is too large to count, each one making another more likely.
A thousand tiny cuts7.
You have to carry the fire
I’m aware that the picture painted in this blog may feel bleak. Yes, it is about the downfall of software systems. But more than that, it’s about the struggle that many engineers face trying to keep those systems afloat against the backdrop of a world that tells them they’re not as valuable as they used to be.
And yet despite all of this, I am prouder today to call myself a software engineer than I have been in a long time.
When I first began writing this blog, I realised that the spark of pride that I felt for my profession had dimmed over recent years. Dimmed by the enshittification, data harvesting and manipulative business practices that have marred our industry. But there’s something deep within my personality that needs a fight. Something to kick back at, an adversary to overcome, or a cause to stand up for.
What better time than now.
Good engineers will always be needed, even if the world doesn’t understand that at the moment.
In the vast majority of cases there are only two ways to scale a business: add people or automate processes. You can cut costs and remove bloat, but that’s not a business strategy, that’s a defence mechanism. Eventually you’re going to run out of road. Perhaps you believe that AGI is just around the corner8, but until that day comes, there is only one type of person that can automate processes in any scalable way: the software engineer.
Engineers are the heart of what it means to scale a business.
The software they build is the only true way for modern organisations to do more with less.
We’ve already been told plenty of times that engineers will become obsolete: the advent of COBOL, the outsourcing wave, low-code platforms, and many more besides. Each time the same thing happens - those who don’t understand good engineering practices write poor quality systems, then proper developers are called in afterwards to clean up the mess when everyone realises that the magic bullet isn’t so magic after all.
All of these failures have one thing in common: they attempt to make writing code easier. But it’s not about the code; it has never been about the code. What really makes a good engineer valuable is the ability to take the mushy, imprecise nature of the real world, reason about it, and turn it into something wholly logical and methodical that a computer can understand. Not only this, but to do so in a way that can stand up against the ever-changing nature of life in a robust, secure and maintainable way. Code is the main means of doing this right now, but even if that changes the real skill-set remains the same.
In times like these, the challenge for engineers is not to move faster or carry more, but to be deliberate about what mustn’t be lost. To protect the practices that keep systems understandable, changeable and safe, even when pressure makes them feel optional. Every small compromise compounds, whether we notice it or not. This is where standards matter most. Not because they are efficient, and not because they guarantee success, but because they highlight the things that are too valuable to trade away.
Software rarely collapses in a single dramatic failure.
It decays, one compromise at a time.
The only people standing between those compromises and the systems that run our world are the engineers who refuse to trade tomorrow’s stability for today’s speed.
-
It’s DNS. It’s always DNS. ↩︎
-
I’ll withhold my judgement on the current state of the platform’s content moderation and Grok charade. Regardless of your opinion on those things, it is undeniably true that the platform is still functional. ↩︎
-
The CapEx-heavy world of AI data centres would probably have changed this somewhat. Even still, it’s notable that many major names like Ilya Sutskever and Yan Lecun have been allowed to leave their respective companies rather than being persuaded to stay. ↩︎
-
This footnote was added after I initially published this post, but I’d like to use this it to have a little rant about the most recent “we’re laying off staff because AI’s going to make us more efficient” headline.
Jack Dorsey’s Block has just removed 40% of its staff apparently in anticipation of future efficiency gains from AI, somehow causing the stock to jump by ~23%. Mass job cuts, huge uptick in stock price. I have mixed feelings about his more informal but less impersonal announcement (would it kill you to use some proper punctuation when disrupting the livelihoods of several thousand people!?)
However, it’s the overall message that the cuts are due to AI efficiency gains that I take issue with. The memo itself makes clear that they aren’t experiencing any of those gains in a meaningful way right now, it’s about future potential. That being the case, any healthy company doesn’t cut half its workforce in anticipation of being better at some unspecified later date; they track those efficiency gains against how well the company is scaling and reduce staff accordingly. Using, you know, actual efficiency gains… that really exist. To me that whole thing reeks of a struggling company (look at the stock price on the five year time horizon) covering up a need to downsize using AI buzzwords. The lunacy of the whole situation is that the market bought it and the stock had it’s best day in history. ↩︎
-
There is a very big asterisk here: version updates and deprecations. It’s definitely a topic for another time, but software companies and package owners seem intent on pushing version updates at an ever-increasing pace. This kind of change is forced upon software, and only seems to be getting more frequent. Whilst you could argue that this is the software failing without it being altered, I would see this as it failing due to being forcibly changed by factors outside of its control. There is a reason why COBOL libraries running mainframes are still going strong 30 years on, whereas modern C# applications barely seem to last five years: fast-moving external dependencies. ↩︎
-
Cutting a team may sound like the most extreme scenario, but in reality there are usually many other factors that disrupt and cut a team’s expected effective delivery capacity. In a world of hiring freezes, it can simply be the case that employees are moved out of one team to fill a role in another, where normally the company would have hired someone new. It can mean sudden fixed-deadline projects being added to the team’s list of deliveries (such as regulatory requirements). It can just mean each individual within the team having one too many concurrent projects to juggle. In our current “AI-powered” world, it more often than not means requests to do more with less. ↩︎
-
After a particularly frustrating run of tiny failures, I decided to keep a list of little annoyances that I came across for a week. I’ve trimmed anything specific to the internals of the company I work for, but even still this was a shockingly long list.
The specifics of the list aren’t all that important, but I put it here just to illustrate how aggressively these little cuts can add up. It’s also worth noting that most of them happened tens of times that week too…
- Teams start-up times seem to get longer and longer. Sometimes as long as 15 minutes for everything to start enough to join a call.
- Filter window disappears the first time you open it in the Azure portal after short delay. Long enough for you to think it didn’t happen that time, short enough that you can’t click on what you need to in time.
- Delete button for resources in Azure portals sometimes doesn’t work without closing and re-opening the blade.
- Dell camera drivers lock up the camera and prevent applications from using it. If you try repeatedly to enable the camera when this happens, it blue screens the machine.
- No availability for zone redundant Azure resources in EU North, now little availability in UK South.
- Visual Studio stops letting you open the Solutions Explorer without restarting it.
- Replit loses prompts sometimes and never sends them to the agent. They can’t be recalled.
- Teams loses messages sometimes and never sends them. They can’t be recalled.
- When plugged into a monitor for too long, laptop display drivers screw up and random windows intermittently go blurry until unplugged.
- Microsoft’s ecosystem for Azure development is now fragmented across VS 2022, VS 2026 and VSCode. No single IDE properly supports our full working set of languages and versions.
- Security agents occasionally go insane when plugging a USB in, causing the laptop to freeze up and consume 100% CPU until the problem is recognised and the USB device is removed.
- When booting/unsleeping and connected via a VPN, Teams requires three attempts and about 15 minutes of waiting in order to actually log in.
- Physical mute/volume buttons randomly stop working until restarting machine.
- Azure portal resource view crashes when deleting resources sometimes, forcing a page refresh which removes any filters.
- Sometimes when deleting an app service with a VNET, that app service doesn’t actually get disconnected from the subnet, forcing it to be recreated and disconnected manually, then deleted.
- Bruno explanation pop-ups won’t close.
- Visual Studio suddenly starts double-entering every character typed.
- Azure VNETs sometimes just don’t clean up properly - you then have to recreate the app service plan tied to the VNET manually, run a separate command to disconnect the two, then delete them again.
- Teams refuses to render in front of other windows.
- Mimecast fails to authenticate constantly.
- Mimecast search doesn’t work and fails with a generic error and no code.
- Chrome’s horizontal scroll bar stops working.
- The search bar in Teams refuses to open.
-
I very much do not. Feel free to take a look at one of my other blog posts about AI if you’d like to know why. It’s almost a year old now, and I’m pretty confident nothing meaningful has changed. ↩︎