
We should design and build systems in a way that ensures any change can be easily reverted, even if the people who have worked on it are not available. This post covers some best practices to achieve it.
Introduction
We should design and build systems in a way that ensures any change can be easily reverted, even if the people who have worked on it are not available at that time. Imagine, for example, that a change starts causing memory issues on a Saturday evening, surely you want to be confident that the support teams can revert to a previous version without downtime and without involving multiple teams. The changes below are easy to implement and will give you peace of mind.
API design
API endpoints and request and response bodies should be designed without thinking about a specific consumer (e.g. endpoints that just return data filtered for a specific case), as soon as it may be needed to change them to support other customers. And they should hide any implementation detail, as it should be possible to even change the underlying infrastructure (e.g. retrieving data from Redis or Elasticsearch instead of from a database) without impacting any consumer.
Breaking API changes to avoid
API releases should not contain breaking changes like adding new mandatory fields to an endpoint, changing the format of the response, etc., as it may be needed to update multiple consumers, and it may not be straightforward to release all of them at the same time, especially if they are managed by different teams. If needed to do these types of changes, consider using versioning.
Versioning
Using versioning on the URLs allows having an old and a new version of the API running in parallel until all the consumers have time to update. It can be done by having duplicated endpoints inside the same API or by deploying separate instances with the old and new versions. The latter approach is usually easier to maintain as the code can get complicated if it has two versions, and it is also easier to release a change on any of the versions, as they can evolve separately.
Contract testing
Contract tests detect breaking changes between APIs, so you can automatically notice if a change in a service impacts any of its consumers. This is very useful before and during releases, as these tests can be run as part of the pipelines and abort them if they find any breaking change.
Database changes
Some database changes are difficult to roll back, e.g. deleting a column or changing its content or type.
Column deletion
Imagine deleting a column and having to roll back the following day. Not only will you have to recover the data from a backup, but you won’t have the values of the last few hours. The deletion can be done in three steps: Modify the code so it keeps populating the column, but it is not read anymore. This way, the values will be present if it is needed to roll back. Make the column optional and remove any existing usage in the code so it is not populated anymore. This could help detect any unexpected issue, e.g. an old report that nobody may know about and is using that table. Do the actual column deletion.
Column change
Imagine you change an enum value or a date format, and it breaks some critical month-end reports; it would be quite hard for the ops teams to figure out how to roll back. Similar to the previous case, these changes can be done in multiple steps: Create a column with the new format, and modify the code to write on the new and old columns and only read from the new one. Make the column optional and modify the code so it doesn’t write to the old column.
Queue/topic changes
Messages usually have to be modified to add more fields, and the code should be configured to ignore new fields that it is not expected to see, supporting new and old consumers of the message. Apart from this, in case it is needed to change a field, a new one can be added, and the code modified to only read from the new one. In this case, it may not be viable to delete old fields as a topic may have different consumers that may want to reprocess all the messages, e.g. when Kafka topics are used as a single source of truth.
Summary
We have seen some practices that will make your deployments more robust and will ensure that support teams can easily roll back and restore the system to a previous status without involving product teams outside normal working hours. Do you follow these or similar practices? If so, I would love to read about them in the comments.

Leave a Reply