The Hardest Thing About Software Engineering

(It's not something your Ai can do)

Over the past week, I’ve taken down not one, but two production servers. On two entirely different applications!

In both instances, these code changes were dutifully tested. I swear it was not a skill issue. Luckily the downtime was minimal (in both cases) and I was able to quickly rollback the changes. But the sentiment remains—software engineering is hard.

The thru-line between these two production server crashes?

Old, unsupported software.


Alright, now. I’m going to make you feel old.

Did you know Rails 6.x is EOL?

Did you know Ubuntu 18.04 is 7 years old?

PostgreSQL 13 was released almost 5 years ago, and will be EOL in 8 months from now.

In the grand scheme of things, I’m still relatively new to this stuff. But still. I remember when Rails 6 came out and was the hot new thing! Now, we’re about to deprecate Rails 7.0!

As my coworker put it:

"Back in my day, it took us a month to upgrade from Rails 4 to Rails 5, and we liked it!"

My Bones Are Made of Glass

What I’ve learned over the years is that the hard thing about software engineering is not

  • Building an MVP

    • Usually writing a green-field app is easy (and pretty fun)

  • Learning programing languages

  • Hell, it’s not even Javascript framework + dependency nightmare hell-scape, webpacker BS.

What’s actually hard in software engineering:

  • Naming things

  • Navigating people + politics

  • Making computers communicate with each other, in a way that both parties understand

    • “I send you these bytes, which mean this — you receive them, and read them, and understand them.”

  • And the more recent one (to me) — Walking carefully around unsupported versions of software

Once a software library, (whether it be a version of a language, a framework, an Ubuntu version, or some other dependency bundled with your software) is >= 5 years old, all hell breaks loose.

It’s basically this quote from SpongeBob:

I was born with glass bones and paper skin. Every morning I break my legs, and every afternoon I break my arms. - Fish Guy from SpongeBob Squarepants

For example.

Let’s say, hypothetically, that you have a legacy app running on an AWS EC2 instance with a type of t2.medium and it’s running on an Ubuntu 18.04 image (which, again, mind you, is 7 years old).

Over the last couple of months, this instance has been getting closer and closer to running out of CPU credits. It has seen increased usage, and it simply needs more fire power. Maybe it should even be upgraded to a larger, dedicated instance! Like an m5.large.

So you start looking into changing the instance type. Without a whole bunch of extra work, we’re looking at at least a little bit of downtime, as we shut the instance down to change its instance type.

But wait! We need to make sure that this whole setup (the EC2 instance and its attached EBS volumes) is compatible with this new instance type. And, well, it’s not great that this instance is running Ubuntu 18.04.

The most logical thing to do would probably be to upgrade this to, say Ubuntu 24. But doing so would require a non-trivial amount of work, and would need to be done with utmost care, as this is a production environment after all.

You’d probably want to spin up a copy of this instance, inside of which you could test the upgrade, and then test the changing of the instance size. And we have to make sure that the instance is essentially an exact carbon copy, in order to make this test as realistic as possible.

But wait, how many hours do we have allotted for this, again? What, none??

Okay, so the client is sensitive to cost. Aren’t we all?

So we can’t do the Ubuntu upgrade. But let’s do the best we can given these constraints. We can make an AMI image, with EBS volume snapshots in case anything catastrophic happens and we have to rollback. Then, we can stop the instance, and upgrade to an m5.large.

Now, when we restart the instance, initializing… initializing…

And, 3/3 checks pass! Perfect!

Now, if we can just test the ssh connection… And…

Nothing.

No response. Okay let’s check the logs. Nope absolutely nothing there.

But, b-but, AWS says 3/3 check pass! I see three, count ‘em three green checks!

This hypothetical scenario teaches us an important lesson. I’ll repeat what I said before—when software is old, all hell breaks loose.

AWS can only cover so many scenarios. I don’t blame their systems for saying all systems were a go when the box was absolutely, 100%, never-even-got-past-boot, busted.

Thanks for reading Slam Dunk Software! This is only my second post. Subscribe for free to get more stuff like this <3

What’re’ya Gonna Do ‘Bout It, Huh?

Come on, Devin, come on, Cursor, come on GitHub Copilot, enter the arena. I’d like to see you have a go at things. Go get ‘em, tigers!

At the end of the day, I’m bullish on real-life, human software engineers. Because the more lines of code these tools write, the more scenarios like the aforementioned are going to pop up.

And no matter how good an LLM-system gets at extracting context from a codebase, they’re gonna be $&%#’ed when it comes to incorporating a 3rd party contractor’s solution into the codebase, and trying to tease out that new contractor’s API structure, and how to name the damned thing. These are human problems, and demand a human solution.