The Code Works But The Code's Wrong

The most dangerous bug is not in the code but in the mind of the person writing the code. You can have code that works which hides a bug that exists in the design. This kind of bug can lay dormant for years. It’s only revealed when a benign change causes serious problems. This kind of bug can be prevented by treating the design of software as a thing apart and making sure that the code reflects the design as closely as possible. What follows is a story of how some code that worked turned out to be wrong.

The code was hard to decode.
The code had a bug
The bug was fixed
The code was made easier to decode
The new code started making duplicate records
The code was reverted
Accidental Grace
The Error of Modular Reasoning
So what?

The code was hard to decode.

It handled the confirmation of subscription renewals on the app stores(Apple/Google Play/etc…). When a customer first subscribes, we create a subscriber record which had the renewal date. When the renewal date came up we create a new subscriber record for that month, put it in a Pending state and then send a request to the app store to see if the customer renewed their subscription.

A subscriber was encoded in the following data structure:

type Subscriber = {
  state: 'Active' | 'Inactive' | 'Pending',
  invoice_month: number,
  invoice_year: number,
  // The above two fields answer questions like:
  // How many active subscribers did you have in January 2023?
  renewal_date: Date | nil,
  type: 'Regular' | 'FreeTrial'
  // Other fields that don't matter for this story are omitted
}

The specification for renewing a subscription went like this

`RenewalService`

Given a Subscriber in the Active state with a renewal_date in the past
- Create a new Subscriber with a Pending state and the invoice_month and invoice_year of the current month and year.
- Send a request to the app store get the state of the subscriber on their end
  - If the response comes back saying the subscriber is current
    - Update the Subscriber record to Active
    - Update the renewal_date to match the expiration_date in the app store response
  - If the response comes back saying the subscriber is not current
    - Update the state to Inactive

but the code did more. The code handled free trial conversions. Converting a free trial is like a renewal, but not quite. A free trial has a date it ends like a regular renewal(we used renewal_date for this) but that date is not usually the next month. A free trial often ends within the same month it was started.

Adding the specification for the free trial conversions inside of RenewalService leads to following:

`RenewalService`

Given a Subscriber in the Active state with a renewal_date in the past
- If the invoice_month and invoice_year are the current month and year This is the check for free trial ending within the same month it started :
  - Change the state of the Subscriber to Pending while keeping the same invoice_month and invoice_year
    - Send a request to the app store to find out if the customer converted
      - If the response comes back saying the subscriber is current
        
        Update the state to Active
        
        Update the type to Regular
        
        Update the renewal_date to match the expiration_date in the app store response
      - If the response comes back saying the subscriber is not current
        
        Update the state to Inactive
- If the invoice_month and invoice_year are not the current month and year This means it's a regular renewal or a free trial that ended the month after it started :
  - Create a new Subscriber with a Pending state and the invoice_month and invoice_year of the current month and year.
  - Send a request to the app store to get the state of the subscriber on their end
    - If the response comes back saying the subscriber is current
      - Update the state to Active
      - Update the renewal_date to match the expiration_date in the app store response
    - If the response comes back saying the subscriber is not current
      - Update the state to Inactive

It also did a couple more things but they don’t factor into the rest of the story. It restored subscriptions(again something like a renewal, but not quite). Under certain situations that happened with some other code somewhere else in the system it would change a subscriber from monthly to yearly or the reverse. And it created or updated records specific to each app store.

The code had a bug

Some renewals were skipping months. For example, a subscription started in July, that was suppose to get renewed for August, would come up as being renewed for September. The code was costing money.

The bug was fixed

This is not about the bug fix, this is about what was done as part of the bug fix and what happened after.

The code was made easier to decode

I’m a solid dev so I single-responsiblitied the code. I created a FreeTrialConversionService , SubscriberRenewalService, and SubscriberRestorationService.

The specifications for the FreeTrialConversionService and the SubscriberRenewalService are outlined below. You’ll notice that it is just the RenewalService broken up into two separate services. The tests were unchanged and they all passed.

`FreeTrialConversionService`

Given a Subscriber in the Active state with a renewal_date in the past and a type of FreeTrial
- If the invoice_month and invoice_year are the current month and year
  - Change the state of the Subscriber to Pending but keep the same invoice_month and invoice_year.
  - Send a request to the app store to find out if the customer converted
    - If the response comes back saying the subscriber is current
      - Update the state to Active
      - Update the type to Regular
      - Update the renewal_date to match the expiration_date in the app store response
    - If the response comes back saying the subscriber is not current
      - Update the state to Inactive
- If the invoice_month and invoice_year are not the current month and year
  - Pass the Subscriber to the SubscriberRenewalService

`SubscriberRenewalService`

Given a Subscriber in the Active state with a renewal_date in the past
- Create a new Subscriber with a Pending state and the invoice_month and invoice_year tied to the renewal_date This is what fixed the bug. The background job we use to run the renewal process could get backed up at the end of the month and spill over into the following month. This lead to subscribers who were actually renewed in March showing up as being renewed in April. Their invoice_month value should have been March but it was April. This was the source of skipped months. .
- Send a request to app store get the state of the subscriber on their end
  - If the response comes back as saying the subscriber is current
    - Update the Subscriber record to Active
    - Update the renewal_date to match the expiration_date in the app store response
  - If response comes back saying the subscriber is not current
    - Update the state to Inactive

No behavior was changed.

The new code started making duplicate records

It was tens of thousands when someone on the support team brought it up on Slack. It was hundreds of thousands by the end of the day. I could not figure out how I fucked up so bad. This is much worse than the original bug. The duplicate records were breaking a lot more functionality across the system and costing a lot more money. But as far as I knew the new code did the same thing.

The code was reverted

To stop the bleeding and figure out what happened.

Accidental Grace

Let’s talk about the app store response. The responses from the app store were parsed and made to conform to a single interface

interface AppStoreResponse {
	active: boolean,
	expiration_date: Date,
}

The thing that really mattered was the expiration_date. An expiration_date in the past meant that the subscriber’s access to the product had expired so active would be false.

One of the app stores had a different rule.

This app store required us to offer a grace period. This meant that active would be true while the expiration_date was in the past.

This meant that the app store response had this specification:

`AppStoreResponse`

Given a request for subscriber information
- If the subscriber is current
  - active is true
  - expiration_date is in the future
- If the subscriber is in a grace period
  - active is true
  - expiration_date is in the past
- If the subscriber is not current
  - active is false
  - expiration_date is in the past.

The specifications for renewing a subscriber and the app store response never matched. Subscribers who were in a grace periods were continually being picked up by the query that picked up subscribers up for renewal. In the initial implementation, since this line

If the invoice_month and invoice_year are the current month and year.

was true they were treated like free trial conversions. The code “worked” I put the word, worked, in quotes above because the code did a half-assed job. Grace periods were never dealt with so we did not keep track of them or have specific logic to deal with them. This left a significant hole in how we dealt with subscribers. We didn’t even know to ask certain questions or do certain things, for example: How many subscribers are currently in a grace period? Is there a huge jump in subscribers in grace periods after the holidays? We also could have only checked the status of subscribers in a grace period after the grace period was over which would avoid unnecessary requests to the app stores. by accident. The new implementation revealed the mismatch in specifications. The SubscriberRenewalService , made a new subscriber record each time. And each the record ended up in an Active state with a renewal_date in the past which were immediately picked up to go back into the SubscriberRenewalService. Which is what happened in the initial implementation(RenewalService) with the exception that new Subscriber records were not created.

The Error of Modular Reasoning

In his blog post, The Three Levels of Software: Why Code that Never goes Wrong Can Still Be Wrong James Koppel summarizes the three levels of software as:

Level 1: Runtime. The runtime level deals with specific values and a specific environment from a single execution of the program. A lot of debugging is done at the runtime level.

Level 2: Concrete Implementation/Code. At the level of the concrete implementation, we think about what the current implementation could do when given arbitrary inputs and an arbitrary environment. Behaviors that cannot happen are not considered, even if it requires global reasoning to rule them out. A lot of implementation work is done at the code level.

Level 3: Design/Logic: At the level of logic, we consider the abstract specification of each unit of a program. When using other units, we only consider the guarantees made by the spec, and assume they may be replaced at any time with a different implementation. Many programs which are correct when viewed at the concrete implementation level are not correct when viewed at the logical level, because they rely on behavior which is not guaranteed to hold in all future versions. We call this an error of modular reasoning, because functions with such errors lack a desirable property: the ability to argue that the function is correct only from the function’s code and from the contracts of the function’s dependencies, without need to even glance at the dependencies. Most software design is done at the level of logic.

The code worked at the runtime level. The code “worked” at the concrete level. The code did not work at the design level.

He puts it another way:

A program is wrong if the reasoning for why it should be correct is flawed.

You’ll notice that I mentioned the code but never show any. This is because this is isn’t really about a specific implementation(“the code”). It’s about the reasoning that produced the code. The code was written with the reasoning that it only had to deal with regular subscribers and free trials. It could have been implemented in many different ways. It would been wrong no matter how it was written.

The new implementation revealed the error of modular reasoning in the initial implementation.

So what?

Talk is cheap. Show me the code.

Linus Torvalds , Source

I started my programming career believing that code was king but code is getting cheaper to produce everyday. A talk with the person who initially wrote the code would have saved me from a lot of confusion, frustration, and about five hundred thousand duplicate records. So actually,

Code is cheap. Let’s talk about it.

The code comes at the end of a long series of steps. In rough outline the steps are, the request, the requirements, turning requirements to acceptance criteria, coming up with a plan, executing that plan. I’ve merged the last two parts in my programming career. I’ve started executing, expecting to come up with a plan as I go aided with the scaffolding of writing tests first. It’s served me reasonably well but in the last couple of years I’ve realized that the “coming up with a plan” part is its own process. The design of software is a thing apart.

What does it mean that the design of software is a thing apart? It means that the information about the design of piece of a software is not in the code. It means that there are assumptions the programmer makes about the inputs to the code and the outputs of the code that are not in the code. The initial implementation of the RenewalService always expected the AppStoreResponse to have an expiration_date in the future if the the subscription was still active. This was nowhere in the code. It was in the mind of the person writing the code.

After you have a design, the next thing to do is to make sure the code reflects the design as much as possible When it can’t be in the code, it can be in a comment. . If the initial implementation of RenewalService threw an error if the AppStoreResponse returned an active subscription with an expiration_date in the past, the design flaw would have been discovered much sooner.

I urge you to spend time designing your program and then make sure your code reflects the design.

Published: 2023-01-13

Last Edited: 2023-01-29