Emmanuel Genard

The Code Works But The Code's Wrong

The most dangerous bug is not in the code but in the mind of the person writing the code. You can have code that works which hides a bug that exists in the design. This kind of bug can lay dormant for years. It’s only revealed when a benign change causes serious problems. This kind of bug can be prevented by treating the design of software as a thing apart and making sure that the code reflects the design as closely as possible. What follows is a story of how some code that worked turned out to be wrong.

The code was hard to decode.

It handled the confirmation of subscription renewals on the app stores(Apple/Google Play/etc…). When a customer first subscribes, we create a subscriber record which had the renewal date. When the renewal date came up we create a new subscriber record for that month, put it in a Pending state and then send a request to the app store to see if the customer renewed their subscription.

A subscriber was encoded in the following data structure:

type Subscriber = {
  state: 'Active' | 'Inactive' | 'Pending',
  invoice_month: number,
  invoice_year: number,
  // The above two fields answer questions like:
  // How many active subscribers did you have in January 2023?
  renewal_date: Date | nil,
  type: 'Regular' | 'FreeTrial'
  // Other fields that don't matter for this story are omitted
}

The specification for renewing a subscription went like this

RenewalService

but the code did more. The code handled free trial conversions. Converting a free trial is like a renewal, but not quite. A free trial has a date it ends like a regular renewal(we used renewal_date for this) but that date is not usually the next month. A free trial often ends within the same month it was started.

Adding the specification for the free trial conversions inside of RenewalService leads to following:

RenewalService

It also did a couple more things but they don’t factor into the rest of the story. It restored subscriptions(again something like a renewal, but not quite). Under certain situations that happened with some other code somewhere else in the system it would change a subscriber from monthly to yearly or the reverse. And it created or updated records specific to each app store.

The code had a bug

Some renewals were skipping months. For example, a subscription started in July, that was suppose to get renewed for August, would come up as being renewed for September. The code was costing money.

The bug was fixed

This is not about the bug fix, this is about what was done as part of the bug fix and what happened after.

The code was made easier to decode

I’m a solid dev so I single-responsiblitied the code. I created a FreeTrialConversionService , SubscriberRenewalService, and SubscriberRestorationService.

The specifications for the FreeTrialConversionService and the SubscriberRenewalService are outlined below. You’ll notice that it is just the RenewalService broken up into two separate services. The tests were unchanged and they all passed.

FreeTrialConversionService

SubscriberRenewalService

No behavior was changed.

The new code started making duplicate records

It was tens of thousands when someone on the support team brought it up on Slack. It was hundreds of thousands by the end of the day. I could not figure out how I fucked up so bad. This is much worse than the original bug. The duplicate records were breaking a lot more functionality across the system and costing a lot more money. But as far as I knew the new code did the same thing.

The code was reverted

To stop the bleeding and figure out what happened.

Accidental Grace

Let’s talk about the app store response. The responses from the app store were parsed and made to conform to a single interface

interface AppStoreResponse {
	active: boolean,
	expiration_date: Date,
}

The thing that really mattered was the expiration_date. An expiration_date in the past meant that the subscriber’s access to the product had expired so active would be false.

One of the app stores had a different rule.

This app store required us to offer a grace period. This meant that active would be true while the expiration_date was in the past.

This meant that the app store response had this specification:

AppStoreResponse

The specifications for renewing a subscriber and the app store response never matched. Subscribers who were in a grace periods were continually being picked up by the query that picked up subscribers up for renewal. In the initial implementation, since this line

If the invoice_month and invoice_year are the current month and year.

was true they were treated like free trial conversions. The code “worked” I put the word, worked, in quotes above because the code did a half-assed job. Grace periods were never dealt with so we did not keep track of them or have specific logic to deal with them. This left a significant hole in how we dealt with subscribers. We didn’t even know to ask certain questions or do certain things, for example: How many subscribers are currently in a grace period? Is there a huge jump in subscribers in grace periods after the holidays? We also could have only checked the status of subscribers in a grace period after the grace period was over which would avoid unnecessary requests to the app stores. by accident. The new implementation revealed the mismatch in specifications. The SubscriberRenewalService , made a new subscriber record each time. And each the record ended up in an Active state with a renewal_date in the past which were immediately picked up to go back into the SubscriberRenewalService. Which is what happened in the initial implementation(RenewalService) with the exception that new Subscriber records were not created.

The Error of Modular Reasoning

In his blog post, The Three Levels of Software: Why Code that Never goes Wrong Can Still Be Wrong James Koppel summarizes the three levels of software as:

The code worked at the runtime level. The code “worked” at the concrete level. The code did not work at the design level.

He puts it another way:

A program is wrong if the reasoning for why it should be correct is flawed.

You’ll notice that I mentioned the code but never show any. This is because this is isn’t really about a specific implementation(“the code”). It’s about the reasoning that produced the code. The code was written with the reasoning that it only had to deal with regular subscribers and free trials. It could have been implemented in many different ways. It would been wrong no matter how it was written.

The new implementation revealed the error of modular reasoning in the initial implementation.

So what?

Talk is cheap. Show me the code.

I started my programming career believing that code was king but code is getting cheaper to produce everyday. A talk with the person who initially wrote the code would have saved me from a lot of confusion, frustration, and about five hundred thousand duplicate records. So actually,

Code is cheap. Let’s talk about it.

The code comes at the end of a long series of steps. In rough outline the steps are, the request, the requirements, turning requirements to acceptance criteria, coming up with a plan, executing that plan. I’ve merged the last two parts in my programming career. I’ve started executing, expecting to come up with a plan as I go aided with the scaffolding of writing tests first. It’s served me reasonably well but in the last couple of years I’ve realized that the “coming up with a plan” part is its own process. The design of software is a thing apart.

What does it mean that the design of software is a thing apart? It means that the information about the design of piece of a software is not in the code. It means that there are assumptions the programmer makes about the inputs to the code and the outputs of the code that are not in the code. The initial implementation of the RenewalService always expected the AppStoreResponse to have an expiration_date in the future if the the subscription was still active. This was nowhere in the code. It was in the mind of the person writing the code.

After you have a design, the next thing to do is to make sure the code reflects the design as much as possible When it can’t be in the code, it can be in a comment. . If the initial implementation of RenewalService threw an error if the AppStoreResponse returned an active subscription with an expiration_date in the past, the design flaw would have been discovered much sooner.

I urge you to spend time designing your program and then make sure your code reflects the design.

Published: 2023-01-13

Last Edited: 2023-01-29