On the 19th of September a deploy of the service attending documents storing and fetching in the NitroSign context caused a brief glitch where some customers were unable to download their documents.
A new version of the service handling documents was deployed that was storing documents metadata with a breaking change on its schema.
After evaluating the changes the team decided to roll out the new version to all nodes skipping the canary phase.
Breaking change in the JSON that was used to store metadata information to DynamoDB. This caused data written by the just deployed version (in canary) to break the old version still taking much of the customer traffic. In particular, the problem was manifesting itself when an user was creating a document using the new version
During ~50mins some users failed to load their new signed documents.
What Went Well?
- Quickly pin pointed the root cause
What Didn't Go So Well?
- Breaking change to a data schema.
- Team will figure out a way to version data written to out persistent store in order to not introduce breaking changes.
- @ 11:58 A new version of document-files gets deployed that makes some changes on how we store metadata document in DynamoDB
- @ 11:59 As soon the canary was out the synthetic: ui-sign-sla prod started to fail
- @ 12:02 The AllOps engineer gets paged and immediately ACKs the alert
- @ 12:12 The AllOps engineer individuated the culprit and contacted the team responsible for the failing component
- @ 12:40 The team recognised that an ongoing canary deploy of document-files was the probable issue. Somehow the old nodes were not able to read data written by the canary (still to confirm).
- @ 12:42 The team decided to speed up the canary process and promote to all nodes
- @ 12:45 The deployment finished and the functionality was restored
- @ 12:50 Incident resolved