System will never go down without "changes", e.g. code release, traffic overloaded or external dependency down, ...
In all kinds of changes above, human changes are responsible for over 80% of incidents, as humans are not machines and make mistakes all the time :)
So when planning a change request for production operation (without a perfect and automated pipeline), how can we leverage strategies to minimize the risk and impact on our customers?
Here are three tips for planning a change request:
- Change gradually, for example, internal users -> 1% customers -> 5% customers -> ...
- Verify the change result through dashboards or logs
- Ensure that the change can be swiftly rolled back
Change Request Risk-Minimization Model:
p.s. For difference services, we can adjust the pace of change based on the level of risk acceptance.