Zero Downtime Deployment Techniques – Canary Deployments
- Posted by Daitan Technology Council
- On February 10, 2022
- Canary Deployments, DevOps, Zero Downtime Deployment
Deploying a new version of a product to production is one of the most critical moments in the Software Development Lifecycle. It can go from sheer excitement to release the latest features, to a nightmare of cascading failures and outages. In this series of posts, we will explore deployment techniques that can be used to deploy a new version of an application without causing disruption to end users.
It is time to deploy a new version of the product. All the development work is done, the tests have passed and the stakeholders have approved it. Now we need to schedule a maintenance window in the middle of the night on a weekend, stop traffic to all servers, update the software following a lengthy procedure with lots of manual steps, restore traffic to the servers and hope everything works.Sounds risky, right? This has been the standard way of doing deployments for a long time. Fortunately, there are better ways. On this Zero Downtime Deployment Techniques series of posts, we will highlight three of the most common ways of deploying software without downtime. In our two previous posts we covered the Rolling Updates and Blue-Green deployment techniques and now we will close the series with Canary deployments.
A Canary deployment is a zero-downtime deployment technique that allows a gradual transition of traffic from the current version of an application to a new version. The gradual transition of traffic is usually orchestrated using a data-driven approach, with metrics dictating whether to continue or roll back. It is considered an improvement over Blue-Green deployments which implement a complete cut-over of traffic to the new version.
How it Works
Similarly to a Blue-Green deployment, the new version of an application is also deployed and tested on a separate environment, while the original environment continues to handle production traffic. The difference is that after the new environment is validated, only a small fraction of the total traffic will be redirected to this new version at first.
There are different strategies for choosing which traffic goes to the new version first. You can evenly distribute requests, choose “canary” users at random or select from a group of beta testers. This last approach might lower the marketing impact of failures, as users opting to be beta testers are usually more inclined to accept problems, and can even help developers to detect and find solutions for these problems.
The application is continually monitored for failures in the new version. Metrics like HTTP errors, latency, timeouts or application specific metrics can be used to decide if traffic should be increased on the new version or if it should be rolled back.
After all the traffic is flowing to the new version, the deployment is complete and the old environment can be terminated or kept for some time in case a rollback is necessary. There are several approaches to terminate all routes to the old version, but the best practice would be to wait for the current sessions in the old version to terminate gracefully, instead of abruptly terminating them during the turn over.
By carefully controlling the traffic flow between the current and new version and monitoring for potential problems, the blast radius of a problem caused by the new version can be reduced. The reduced impact of failures can give the team more confidence to adopt techniques like Continuous Deployment.
When implementing this technique, there can be some challenging situations depending on the nature of your workload. In this section, we will list common problems and possible solutions. Most of these challenges are also present in the Rolling Updates and Blue-Green Deployment techniques.
Similar to a Blue-Green deployment, you will need to keep 2 separate instances of your infrastructure up at the same time, at least for the duration of the deployment but possibly for more time to allow for old sessions to drain or a quick rollback to the previous version. But with Canary deployments you can use auto-scaling policies to gradually scale up the new version and scale down the current version.
If this technique is used to deploy stateful applications, transient information that is stored in the instance (like user sessions, cached files, etc.) might be lost when traffic switches to the new instances.
If it is not feasible to store this information outside of the instance, a possible solution is to keep the current environment up until it finishes processing ongoing requests or sessions. New requests and sessions will be routed to the new environment.
Database changes must be handled with extra care to ensure that they work with the current and new versions of the application. It is very important to test the new schema with both versions because they will be running concurrently. We will have a separate post addressing techniques to ensure database schema backward and forward compatibility.
In order to support both versions running at the same time, the current version of the application must be able to work with data created by the new version. This usually means that the current version must be able to handle extra fields in a database table or event schema without crashing. For instance, if an API changes and includes an extra field, it is important to ensure that the previous version will still work with the new API format. Also consider using feature flags to decouple the release of a feature from the deployment time.
Most of the value of using this technique comes from the ability to decide to move forward with the deployment based on real time metrics for the new version. This requires the ability to separate metrics for the current and new versions of the application. Reaching this level of observability maturity may be a challenge in some cases.
When to Use
The Canary Deployment technique is considered a step forward from the Blue-Green Deployment and Rolling Updates techniques and should be used when you have a high level of automation and observability of the application.
This technique will provide an additional layer of resilience to your deployments by automatically monitoring the failure rate of the new version using real traffic and automatically rolling back if necessary. It should not be used if your monitoring is not mature enough to identify when your application is misbehaving or if you do not trust the automation to rollback a version automatically.
Given the routing and data flow challenges, the adoption of this technique must be carefully considered and planned. A gradual migration of individual services is recommended. It is usually better to start with a Blue-Green Deployment, improve monitoring and observability and then implement the Canary routing strategies.
The Canary deployment strategy is an advanced technique that provides zero downtime deployment and a safe way to gradually deploy a new version, making decisions based on metrics. A high level of automation and observability is necessary to support the real time decisions. While implementing this technique, keep in mind these key points:
- Introduce tests to ensure the application is forward compatible. It may be necessary to change code to support extra parameters on database tables and event schemas.
- Make sure your monitoring can segregate metrics for the new version and detect failures on it.
- Test your deployment logic by introducing synthetic failures during a deployment and verifying that the system reacts accordingly.
- Use auto-scaling to minimize the costs of having two environments up during the deployment.
This is the last article in a series written by Isac Sacchi e Souza, Principal DevOps Specialist, Systems Architect & member of the Daitan Technology Council. Thanks to João Augusto Caleffi, João Sávio Ceregatti Longo and the SRE/DevOps Community of Practice for reviews and insights.
Schenker, Gabriel N et al. Getting Started with Containerization. Packt Publishing, 2019.