Technical Debt Financing

The following is an excerpt from Executive Engineering.

 

Your task is not to foresee the future, but to enable it.

― Antoine de Saint-Exupéry, in The Wisdom of the Sands

 

Technical debt is one of the most powerful tools a CTO has. Your engineers may believe technical debt is bad and holds them back — it’s certainly a cause of stress and a time sink — but I believe it only hurts us as an organization when we miss the larger picture.

The best definition of technical debt is “An obligation for future technical work.” It’s not “bad code” or messy data, though those might comprise it. With that definition, let’s improve on conversations about technical debt which often make two crucial mistakes: They frame the debt as an ergonomic issue that primarily affects engineers (instead of a financial issue that affects the company) and they don’t discuss it in the context of technical investments, interest rates, or ROI.

a technical debt portfolio

It’s not just hygiene

Engineers are the ones who directly experience technical debt but this doesn’t mean it’s an issue affecting engineers; it affects the whole company. Your CFO might express worry about a bad contract the company got itself into but no one would say that that’s a problem primarily affecting the CFO. The CFO has merely identified the problem.

Technical debt is a contract that the company has signed. It’s important to carefully pay these debts in order to free up Engineering’s time for other work. The fact that it also reduces engineer annoyance is a bonus, but should not be the primary motivator.

This might seem like a meaningless distinction but it lets us be clear about where the debt comes from: Strategic decisions made by leaders at the company. Technical debt doesn’t come from poor-quality engineering or the work of junior people, it’s a contract the company signs to move fast now and go slow later. Debt appears when the company incentivizes (or allows) technical shortcuts for immediate value payouts. Leadership sets the standards for the work and implicitly communicates what kinds of timelines are most important (usually very near-term timelines).

So, since technical debts come from leadership decisions, we as leaders can change the debt strategy of the company as we wish.

Interest Rates

If we go ‘fast now’ and ‘slow later’ we should probably make sure the ‘fast now’ is worth it. Technical debt has distinct terms and payment structures in the same way as financial debt. You could, right now, delete all the automated tests at your company. You would get a small but real increase in shipping speed for a few minutes. And then all hell breaks loose.

That’s like taking a 12-hour loan from the mob.

You could also, right now, approve a new backend language to the list of languages your company supports. Depending on the language and the way it’s used this might really help your teams. You’ll have to support it, but perhaps it’s worth it.

This is more like taking out a traditional bank loan.

As we think more clearly about principal and interest rates we’re able to compare debts against each other. This is the first step to deciding which ones to pay down and which to ignore.

We do not want to eliminate all of our technical debt. Maybe someday, when the company issues stock dividends instead of reinvesting in R&D we can take that as a sign that there’s no more useful product expansion and we should pay down our existing debts. Until then, we accrue debts so we can move fast.

We want those debts to be at an extremely low price, both in interest payments and payoff events, so we can use the capital (our newly-unlocked free time) to make big investments. Just like in the world of finance, we can’t make a profit just by taking on debt; we have to purchase something with the liquidity that debt gives us. We make a profit when the thing we purchase is worth more than the debt.

Technical Investments

A technical investment is something that accelerates future work or earns future revenue.

For example, when we make deployments faster we earn a return every time we deploy, forever. When we migrate to a better-supported software framework then we have fewer edge cases that slow us down. When we reduce domain complexity or standardize the libraries we use then we add an accelerant to all future development.

These are all technical investments that reduce the cost of doing our work, and they’re the kinds of classic tech debt initiatives you might see championed by senior engineers on your teams who are frustrated and know that it’s possible to move faster.

Not all investments involve paying down the cost of doing work: Feature development is also an investment – the most obvious one. We create new value out of nothing and then we sell it and earn revenue (directly or indirectly, depending on your business model) forever.

If both cost reduction initiatives (like standardizing libraries) and feature development are investments, then we can prioritize them against each other. We can unify our mental framework of technical debt and technical investment into a more traditional financing portfolio. We’ll look at that framework in detail in Calculating Technical Debt but first let’s review how teams tend to address technical debt and why those attempts usually fail.

Common Pitfalls of Addressing Technical Debt

It’s rare to see a sophisticated technical debt portfolio. Even companies that have a rigorous product research and feature prioritization culture might get a little hand-wavy when talking about debts.

More likely, the discussion of technical debt is limited to listing very annoying, bothersome things that engineers wish they had time to fix, with some of those hardest, most frustrating problems elevated to a kind of bogeyman that even individuals far outside engineering might hear about regularly.

It’s easy to find agreement that technical debt is bad and less of it would be really good. When we try to prioritize specific debt is when the agreement is hard to come by — particularly if valuable product features need to be delayed because of it. Earlier in my career I presented to an executive team about pressing technical debt problems and had them all — from Marketing to Legal to Operations — tell me that they hear about the debt all the time and they sympathize, but that it wasn’t clear how to eliminate the problem.

As I stood in the conference room with my slides up I realized I didn’t have an answer for them. I, like most engineering leaders, knew the debts were a problem but I didn’t have a plan that went beyond “the engineers agree we should fix this.”

Actually trying to address debt tends, in my experience, to start off well-intentioned and then get stuck almost immediately in one of 3 specific non-solutions.

Failmode: 20% time allocated to debt

A popular one one is to attempt to carve out some time to address debt. Either declaring that one week a month is ‘debt week’ or 20% of time should, roughly, be allocated to debt. Some teams mark individual tickets as ‘debt’ and try to pull in a set number of debt points to each sprint.

This is a recipe for both low morale and ever-higher debt. Allocating 20% time to tech debt produces a kind of digital stagflation. Partly because there’s no concept of the relative cost of different debts so there’s no way to determine the best debt to prioritize. But also because there’s no relationship between feature development and debt payments so the proportion of time allotted is entirely arbitrary. It’s usually about 20% because that’s as much as we can stomach losing from feature development, which feels like the more important work – even if the feature development is producing debt at a higher rate than we’re paying it down.

Feature development is more important than debt payments along a very short timeline. But on a longer timeline, prioritizing mostly feature development will suffocate future feature development. As long as the team that’s trying this 80/20 mix can only look one or two sprints — or even just one or two quarters — into the future then the debts will never actually be a priority. And since the debt is competing for the same work time as value-creating features, the only way the debts get paid is by grumpy teammates performing heroics; a surefire way to ruin a team.

Even if they ever do get the debt under control they go right back to making more debt, ignoring that they may have cleared the way for some high-return investments that are going unstaffed, particularly any kind of tooling that might prevent the high-interest debt from returning.

Failmode: A Technical Debt Team

Worse, a team might be formed with a mandate to pay down debt. This was one of the teams I led at Square back when we had a Rails monolith that was on fire. My teammates and I dove into the inferno and tried to tame the most central and somehow most neglected part of the company’s product.

Unsurprisingly, it was really hard. But the worst part wasn’t the work, it was watching other teams ship features quickly on top of the mess we were cleaning and feeling like we were somehow losing ground despite our efforts.

As a colleague of mine once put it, this “tech debt team” pattern is like sending one team to the office basement and another on a free cruise. The incentive and context mismatch is a recipe for relational conflict between otherwise friendly colleagues.

Failmode: Locally-visible debt reductions

This last pattern is the most intuitive one: Just letting each engineer and team fix the debt nearest them as they see fit.

The benefit of this is that the person paying down the debt probably has the most local context and is likely to move quickly. But there’s no telling whether this is actually important debt. Is this engineer bailing out a sinking ship or are they rearranging chairs on deck while the whole vessel goes down? We return to the same prioritization problem: How do we figure out which debts to pay down with our finite time?

Leadership needs to provide direction on which debts to service interest on, which to pay off the principal, and which to ignore. And executive leadership needs to provide that direction over an articulable time horizon.

The bulk of this work is accomplished merely by facilitating a healthy, public, and ongoing discussion among your most senior engineers about the various debts across the company. The truth is distributed unevenly through the minds of your engineers and no individual has the full picture.

This might look like a periodic Technical Business Review meeting where senior engineers update a living document to reflect what they see in the system and spin off any important conversations. It might also be just a leadership culture of giving senior ICs the encouragement and time to explore problems and report on their findings. Either way, most of the debt-tracking work is accomplished by letting the people who can perceive the problems surface them.

That will give you knowledge about your debt portfolio and investment opportunities. But it’s up to you to use that info to make company-level technical debt financing decisions.

Technical Debt Interest Rates

The debt financing metaphor applies much more broadly that we might expect. Not only are there debts and investments, but each debt has a specific interest rate and each investment has an expected return. This is what allows us to prioritize which debts to pay down.

Consider the situation where you owe money on student loans and also to a payday lender. The student loans are at 5% interest and the payday loans are at 100%. You get a surprise $1K as a gift, where do you pay the balance? Maybe split it evenly, $500 on each?

Under this scenario every dollar you get should go to the payday loan until you have it fully paid off. In fact, if you could somehow take out more debt at the 5% rate and use that cash to pay down the higher interest loan that would be wise.

The first step in considering which debt to pay down is just to figure out the rough interest rate. To be clear, this isn’t an exact science. We’re not going to get decimal-level accuracy here as we try to calculate our debt portfolio. But, just like using Big-O notation for algorithms, we can be directionally correct.

Losing Work to Interest Rates

In a high-interest loan very little of the amount we pay actually lowers the principal. The first payment services the interest and the next payment does the same.

This is one of the easiest ways to identify high interest technical debt: What is something that forces your team to do work and then forces them to do the same work again in the future? How much energy gets sucked up by this debt payment? And how long would it take to pay off the principal such that no payments were ever made again?

If you were to make a list of toil like that you’d have a strong start on a debt portfolio.

The sneakier debts are ones that require no ongoing interest payment but the principal is increasing continuously. This might be from forking a software framework that’s now out of date, forcing your team to undo all of their work later in order to upgrade. Or building a software suite inside a monolith without building in boundaries that isolate major components of the products. In each of these cases you might have a surprise in the future where all work for a few months (or years!) is devoted fully to paying down a massive old debt.

Many of the debts in your system likely carry extremely low interest rates, despite how much engineers might complain about them. Maybe you’ve got a single page on your website on an old JS framework but it works and nobody plans to update it. It’s not in the New® Hotness™ and it’s an eyesore but if it’s not an obstacle to your team as they work then it carries a 0% interest rate. It would take the same labor to fix it in 5 years as right now and it isn’t a security attack vector. Which means you should absolutely ignore this debt.

Most debts aren’t obviously 0% and they’re not obviously giant scary ones – they’re somewhere in the middle. My favorite way to calculate these is by thinking about your company’s dimensions of scale and how each one might make the debt worse. For example, which debts grow along the axis of user growth? Which ones grow with engineering headcount growth? With feature count? With data size?

There are many things you can do to organize your debt portfolio but first you need to make it. You need an accurate accounting of the technical debts that matter to your company.

Calculating Technical Debt

The Five Properties of a Debt

There are roughly five things you need to know about each debt obligation in your system:

  • Principal – What would it take to fully pay it off?
  • Interest – How much energy is lost just putting up with it?
  • Increase In Principal – How much bigger will the payoff be in the future?
  • Increase In Interest – How much more energy will we lose in the future?
  • Payoff Events – Are there any inflection points in the future that will necessitate a sudden payment in full?

You may notice I described the debts here in terms of creative energy, not just time. This is deliberate because technical debt costs us cognitive drag, not just time. Creative engineering work does not happen at a constant speed. An engineer might say only 10% of their time is spent on a repetitive task but if you dig deeper it’s appears to be sapping a majority of their emotional and intellectual initiative. Toil like that can push the team to get distracted or snack on low-value work to avoid the more direct, frustrating work. Asking the team how much of their creative energy they feel like they’re losing is, in my experience, a better measure of how much enthusiasm and insight is lost. Which is what we actually pay our engineers for.

To make these concepts clearer lets explore some hypothetical examples.

The Ballooning Postgres Database

One of your teams operates a service where all of the data is in a single growing Postgres database. The team notices that queries are getting slower as the storage size grows but this isn’t a system where performance is critical. Your current cloud deployment supports database volumes up to 20TB of storage and you’ll reach that maximum in a couple years if the current growth curve continues.

The team plans to shard the data and migrate to multiple smaller Postgres instances when they need to. They estimate it’ll take the whole team about 3 months to do that – and it’ll take longer if the data is larger.

The principal of this debt is what it takes to eliminate it: Three months of migrating to a sharded design.

The interest of this debt is how much drag it introduces to your engineers: None at all! Having this simple database setup has allowed them to move quickly on developing features.

The Increase in Principal is how much longer that sharding migration will take as the data grows. Let’s say that your team believes that the work here is mostly code changes but the actual migration of data might take weeks of an engineer shepherding it if the data is close to the 20TB limit. Since most of the work is the code changes we’ll say the increase in principal is low.

The Increase in Interest is zero because the the interest is and will stay at zero.

There’s a Payoff Event where the full principal comes due: 2 years from now. Let’s be conservative and say we need to have it in 18 months.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
3 months * 1 team zero low zero 18 months

The Asynchronous App on MongoDB

Another one of your teams operates a service that uses asynchronous code with callbacks on top of a sharded MongoDB cluster. The team complains about the difficulty of testing the asynchronous code and has had to invent and maintain a testing library to make this pattern accessible to new hires. The data they store is quite relational so they’re frustrated with the document-oriented storage model. Much of their work is creating and fixing secondary indexes. They say, anecdotally, that 75% of their energy is spent just fighting the system.

The database sharding method has plenty of room to grow and the database instances themselves are reliable. But the team dreams of rewriting everything, even though they say it would take a full year.

To eliminate the Principal of this debt requires a full rewrite of the app and a migration to a relational database.

The Interest is the high percentage of the team’s creative energy wasted by this debt.

The Increase in Principal is how much harder it’ll be to fix this if we wait. If the solution is a rewrite that means this is growing in lock-step with the app’s complexity.

The Increase in Interest is high because as this app grows there’s yet more painful complexity to wade through.

However, there’s no Payoff Event on the horizon. The team’s output will decelerate forever but the system will technically keep working correctly.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
1 year * 1 team 75% high high none

Comparing two debts

Say one of your Directors of Engineering supports the managers of both these teams. The director comes to you and says both teams are asking for time to pay down their debts. What insight do you offer?

If we do nothing then in two years the Postgres database falls over and that’s a total outage. So we have to do something about that. Whereas the MongoDB team just gets sadder and slower but their system keeps working. So should we ever prioritize letting them do their rewrite?

That depends on what the purpose of that second system is. What benefit does the company get from this painful asynchronous app? Is the low morale of the team offset by a good feeling that at least their work is extremely important? Or is it a trivial set of features that could be deleted?

Assuming both systems are roughly equal in their importance to the company, I would advise this director to stem the bleeding on the async MongoDB system by curtailing new development in it.

We can cap the growing principal and interest if we don’t add new features into that system. To do this, we’d need to staff a one-time migration to design a replacement system that new features can go into and write a full proposal on (eventually) migrating all existing features to the new system. (It’s important to write that detailed proposal in order to test whether it’s possible to one day fully decommission the existing system.)

Considering the timelines involved, I’d advise this director to give the next year of time to the MongoDB team to build and start to use a new approach, then spend the second year focusing on the sharded Postgresql migration. If the company hasn’t totally transformed after that, the third year the MongoDB team can finish decommissioning the original app.

This is an imperfect science, but as long as high interest rates are tracked and addressed the team should be able to recoup their creative energy.

A Notation for Scale

Note how in the ‘Increase in Principal’ and ‘Increase in Interest’ columns for the second example I put the word ‘high’. That’s not very helpful. How do we compare one ‘high’ against another? What if most of our debts have ‘high’ interest?

Let’s look for a better way to describe how debts can get worse over time.

If you’ve ever interviewed at a company that hasn’t updated their hiring philosophy since the 1990’s you might have encountered Big O notation. It’s a way of describing the worst-case scenario of the performance of some logic as the logic is applied to data. Let’s steal just a piece of this concept to describe the way that debt can get worse over time. Instead of using a single dimension ’n’ as in ‘O(n)’ versus ‘O(n²)’ we’ll look at all of the different dimensions along which a digital system can grow and therefore debts can get more expensive.

Dimensions of scale

This book is focused on engineering teams that support online software systems because that’s been my whole career. These systems grow along many axes over time: You’ll get more traffic, you’ll store more data, and the graph of your data relationships will become more complex. Alongside all that, you’ll see more engineers working on it, more customers using it, and an ever-larger range of dates represented in the production dataset.

You may have other dimensions of scale, depending on your business model.

User Traffic

This is a classic measure of scale. Creating a version of something that ten people can use at once is phenomenally easier than one that a million can use at once, independent of the dataset behind it.

Data Storage

Another classic measure of scale, there are some cliffs to watch out for here (like running out of storage for any non-distributed database) but even incremental growth here will cause your queries to respond slower and your hardware expenses to go up.

Feature complexity

I love keeping track of the number of supported features. Partly because it gives us the Shipped Potential chart but also it’s helpful to know roughly how many different user experiences the system supports. This kind of inventory makes it possible to answer the question “If we build X how many existing features might need to be adjusted?” In practice, a question like that can surface a rough coefficient for multiplying the back-of-the-napkin time estimate that an engineer might give.

It’s not rare for each feature to take longer to develop than the previous one. So, for the purposes of modeling technical debt, it’s helpful to use feature complexity as one dimension of scale along which a debt might get more costly.

Data Modeling Complexity

Most of the companies I’ve worked with should pay more attention to this dimension. This is a measure not just of the number of databases, tables, and columns that exist in the schemas at your company, it’s a measure of overall graph complexity.

If you were to generate a diagram of your production schemas and data relationships it might be super ugly. Instead of a neatly organized tree structure you’ll probably find a few datasets that virtually everything references. These datasets would also be the most painful ones to work with, they’d represent the central concepts of your flagship product, and they’d probably have way too many fields.

If you’ve never calculated the graph complexity of your production schema before, there are plenty of different measurements you can make. To start, I recommend keeping it very simple and just counting two metrics: 1) The p90 and p99 number of columns in all tables, and 2) The number of relationships between tables divided by the total number of tables. As these numbers go up, some debts will become more costly.

Number of Employees at the company

Is there manual work for your technical team every time one more person joins the company? Or are the administrative interfaces that employees use to manage and operate the product starting to creak at the seams?

This tends to only matter for administrative tools, but it doesn’t look great when an engineering leader is totally surprised by a scaling cliff for tools their colleagues use. And if you find that more than half of your employee headcount is customer support you wouldn’t be the first leader to be startled by that; many startups that fail to prioritize internal-facing debts have to hire budget-destroying Customer Support teams.

Number of Engineers

You’ll likely experience slowdowns in development as your engineering headcount scales but it’s not necessarily related to technical problems. Here you’ll find process debt, cultural debt, and communications debt in addition to some technical debt.

As engineering headcount scales you may find brittleness in your build and deployment tools, especially if any special knowledge might be required to perform a deploy. At some point you may also notice the scarcity of good code reviewers because as headcount scales up the percentage of the system that any individual engineer understands goes down.

In my experience the debts here are almost entirely non-technical. I recommend separating those debts from the strictly technical ones for the purposes of your technical debt calculations. The process and cultural and communications changes don’t compete for your time from the same work queue as the technical ones. Nobody’s going to ask you to choose between shipping a feature or overhauling your internal communications. The expectation is that those two pieces of work can happen in parallel.

The one place I recommend looking very carefully with regard to engineering headcount scaling is Not Invented Here syndrome. Where in the product is there an invention that doesn’t absolutely have to be there? Notice which debts don’t work as well with lots of new hires. Too often a technical system relies on an unnecessary invention understood by a select few – often the same few people who’re needed to do other critical work. This is especially dangerous if the invention was written by a technical cofounder who’s now in an executive role of some kind because their invention has likely been (unconsciously) shielded from scrutiny.

Number of Users

This is a big one. Regardless of data size or throughput in bytes, features that worked for a hundred users rarely work for ten thousand. Administrative interfaces will get slow, synchronous workflows will need to be made asynchronous, perhaps the primitive search system for exploring user data will need to be fully replaced, etc. And any piece of code or UI that operates across multiple users (typically found in analytics jobs and admin interfaces) will strain as user count goes up.

Perhaps more sneakily, the number of edge cases that your team will see in the data relationships scales roughly in line with user growth. Given enough users, you’ll see every possible permutation of user data, each of which will need to be encoded in the test suite using ever more complex data in the tests.

Each passing day

There are some systems that record a snapshot in time or perform date calculations over a range of the full dataset. Even when nobody’s using these systems this dimension of scale can get worse merely from the ticking of the clock.

Using Dimensions of Scale

So instead of saying that an interest rate is ‘high’, we can say that it gets worse along specific dimensions.

In the case of The Asynchronous App on MongoDB the principal of the debt got worse with feature growth and with relational data complexity. And the interest got worse with each new engineering hire at the company. Even if the engineers on the team felt like a steady 75% of their time was wasted slogging through, for each new engineer at the company there’s a greater bifurcation of architectural approaches and increased desire for people to leave this team and work on something better.

We can describe this debt obligation a little better now.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
1 year * 1 team 75% features * data complexity engineer count None

And in the case of The Ballooning Postgres Database we saw that the principal increases as data size increases, so we can be more specific about that.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
3 months * 1 team zero data size zero 18 months

With that, let’s make a technical debt portfolio for your company. We’ll assume you have to deal with both The Ballooning Postgres Database and The Asynchronous App on MongoDB. On top of that, let’s contrive a few other realistic debts for your teams.

Yes, these are all situations I’ve lived through and, yes, forcing you to hear about them is absolutely therapeutic for me.

The Slow Admin Dashboard

You’ve got an internal administrative dashboard app that lets employees navigate user data. It’s existed for years and the pages are getting slow. It has its own permissions to read and write to the databases that sit underneath the product applications and it queries them directly. It only performs ‘SELECT’ queries but they are very inefficient – on average a page loads in about 30 seconds.

You’re worried that one day too many of your colleagues will try loading the dashboard’s index page at the same moment. The page is so data-rich that even a few dozen simultaneous page loads can lock up one of the production databases and cause an outage. Your engineers want to replace it with a new client side app that uses HTTP endpoints to the product applications instead of direct access to their databases.

There are actually two debts here. There are two things that need to be done, so there are two obligations for future work. One is that you’ll need to start using read replicas for this dashboard app. Simply loading a page should never cause a production outage and the best cache for a database is a database’s read replica. This debt has a Payoff Event that’s imminent. The other debt is that you’ll need to move away from direct shared database access as an architectural pattern.

Let’s make an entry in our debt portfolio for each of those.

It’ll take about a day to move the queries from the primary database to a read replica and payoff the Principal. There isn’t any repeated maintenance labor caused by this app (other than perhaps a looming dread) so the Interest is zero. That won’t change so the Increase in Interest is also zero. And the cost to switch to a read replica is the same both now and later so the Increase in Principal is zero.

Fixing the architecture of this whole app is a far, far harder job. The engineers say they can do it in 6 months but maybe you’ve seen this before and know that the Principal here will take a full team 2 years at a minimum. Since fixing the architecture is effectively a rewrite you see an Increase in Principal with each new feature that’ll need to be ported from the old way to the new way.

The slowness of the app causes all employees to pay an Interest payment of about 30 seconds every time they use a page. And it’s getting slower so you see an Increase in Interest as database size grows, as the user count increases, and as more employees use the pages.

To make it all worse, the load balancers for this app have a maximum 60 second timeout so as soon as these slow pages can’t respond in that window the app suddenly stops working completely. Looking at some charts you estimate that Payoff Event will happen in 9 months.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
Admin uses read replica 1 day zero zero zero any day now
Admin uses HTTP APIs 2 years * 1 team Waiting 30 seconds to see any page Features employees * data size * user count 9 months

Both of these are critical. The second one, at least, won’t fail immediately. So the right course of action is to fully pay off the first one as soon as there’s a good moment in the team’s cadence and then work on a thoughtful strategy for addressing the second – especially considering the payoff event is sooner than the principal payoff period.

Engineers Cloning Production

An early employee created a script that dumps a production database directly to standard out and pipes it into their laptop’s local database. It’s extremely popular because any engineer — particularly junior ones — can download production data to their laptop and run the app in development mode to see how their code changes perform in production.

You hope to raise a round of funding soon and you know that that’ll require an audit of your security and compliance. This will be flagged as a major compliance violation because user data should never leave the production environment. And should definitely not be sitting around on an employee’s laptop – certainly not the entire user dataset.

You also know that this poses a handful of other major problems for your organization in both engineering culture and technical sophistication. Here, I’m going to talk about the nuance of this debt but in case your company is leaning toward this pattern let me urge you in the strongest possible terms not to do it. Your teams should have production-like data in their unit test fixtures. Until that’s true the product development quality will be low and there’ll be a temptation to download data from production.

Calculating this debt is interesting. The debt isn’t that bad right now, though there are some Interest payments in the form of lower security, script maintenance, and the lack of comprehensive test fixture or test factory data. But there are several Payoff Events and when we suddenly need to move away from this approach it will take an unknown amount of work to pay the Principal and get our system in shape to be developed through better patterns.

There’s an Increase in Principal as more code is written with insufficient unit tests or poor service boundaries, and each new engineer who joins and uses this pattern puts us further behind.

There’s an Increase in Interest with each new engineering hire as people start asking for slight improvements in the script or for better production data scrubbing and development time gets allocated there.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
Cloning production data to laptops unknown 5% features * Engineers features * engineers Any audit, Hiring a staff engineer

Messy core product modeling

For the purposes of this example, let’s say your flagship product involves file storage. This is such a key concept at your company that the word ‘file’ appears in most conversations and is in the center of most technical whiteboard drawings. There’s a class called ‘File’ in the main app and it’s out of control: thousands of lines, relationships to most other classes in the codebase, and with what appears to be two poorly-implemented finite state machines within the class itself.

Paying down the Principal here is just a ton of work. Hundreds or even thousands of careful refactorings, moving logic out of this class into something better encapsulated and modeling File-related concepts in their own classes. Your team pays Interest every time they touch this file or an adjacent one and, anecdotally, it takes a week to do in ‘File’ what could be done in an hour elsewhere in the system. Every time new features are added there’s an Increase in Principal and an Increase in Interest.

How would you calculate the actual interest rate for File? Taking a week to do an hour’s worth of work sounds extreme – that’s a 40x slowdown whenever work gets close to this code, or 97.5% of energy going to servicing interest payments. But not all the work happens here, most of your team is working elsewhere in the system so it’s not like all of engineering is slowed down 40x. If it were, this would be the most critical piece of debt to pay off before anything else.

This situation is common and also a good reason to have a finished debt portfolio. Because it’s impossible to know whether to pay this down without knowing how many File-related features are on the roadmap and what other debts are also in the way of upcoming work.

To calculate this debt let’s just get a rough sense for how much time engineers will spend in this are of the code and make a guess at how much this debt might frustrate them. Maybe your upcoming features will be half of the speed they could be (a 50% interest rate) because a few of them require working in this debt. Rather than agonize over the exact baseline make an educated guess. If you’re wrong it’ll come up in discussion with your senior ICs.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
Messy core product modeling years ~10-50% features features no

A Full Technical Debt Portfolio

Let’s put all these contrived but believable examples together and see if we can compare them against each other.

Principal Interest Increase in Principal Increase in Interest Payoff Event(s)
Ballooning Postgres Database 3 months * 1 team zero zero zero 18 months
Asynchronous app on MongoDB 1 year * 1 team 75% features * data complexity engineer count None
Admin uses primary database 1 day zero zero zero any day now
Admin reads from database directly 2 years * 1 team Waiting 30 seconds to see any page Features employees * data size * user count 9 months
Engineers Cloning Production unknown 5% features * engineers features * engineers Any audit, Hiring a staff engineer
Messy core product modeling years ~10-50% features features None

It won’t be clear and obvious which of these to work on and it what order unless we know what we’re trying to accomplish next. Well, perhaps moving the admin app to a database replica is a shoe-in for prioritization because of the low cost and immediate payoff. But prioritizing the rest depends on what features the company wants next.

Will your teams be working exclusively in the ‘File’ class? If so, better make a plan to either clean that up or somehow mitigate the worst parts of it.

Does the company critically need to improve customer NPS and lower customer service headcount growth? Perhaps a super-responsive and more powerful admin dashboard is a must-have.

Are you planning to double engineering headcount in the next year? If so, both the asynchronous MongoDB situation and the production cloning pattern might need to be remedied immediately.

The point of a technical debt portfolio isn’t to make a TODO list of debts to address, it’s to gain knowledge of the financial landscape around you so you can make winning investments.

An Executive View of Technical Debt

Debts let us move fast now at the cost of going slower later.

Investments give us value so we can survive and hire people to pay off debts.

Taking on low-interest debts while making high-ROI investments allows us, over time, to reach our maximum speed.

But what’s the timeframe? When can we afford to go slow and when must we go fast?

That is a conversation that can only exist between you and the CEO (or other leader) you report to. Only your lead can determine what the major milestones are and when they need to be met. Only you can provide the context necessary to reframe the conversation from lots of tiny sprints forward (which accumulates debt and forever slows development) toward a measured set of debts and investments over a long timeframe.

To make that more concrete, let’s examine scenarios of companies at different growth stages and the kinds of technical investment approaches that might be appropriate for each.

Taking the Executive View

Most of the time, leadership isn’t stepping out front and saying “This is the way, follow me!” but it does frequently involve giving a map and a route to your people. Even if you don’t know how to draw the map, as the leader you need to source it and provide it.

Remember we’re trying to minimize drag here. Cognitive drag, the energy-sapping confusion and conceptual graph complexity that an engineer feels when trying to be successful in a complex environment.

As an executive leading a modern software engineering organization you’re navigating your people through a landscape of sheer complexity. The territory in which your engineering teams work is literally interconnected concepts written down. That’s what software is. Leadership, here, needs to be some kind of guidance through that maze of concepts that allows your engineers to put only the necessary concepts in their head to navigate from where they are to where they need to go.

A technical debt portfolio is a map of conceptual territory along the dimension of time. It marks where there are impassible obstacles, where there are arduous hills to climb, and where there are smooth paved roads. While an architectural diagram can describe where you stand in that territory right now, a debt portfolio gives you insight into the safe roads for the journey ahead.

Receive future posts via email. Typically only a few per year.