Investigating Issue
Incident Report for Nitro Sign
Postmortem

Overview

On the 19th of September a deploy of the service attending documents storing and fetching in the NitroSign context caused a brief glitch where some customers were unable to download their documents.

What Happened

A new version of the service handling documents was deployed that was storing documents metadata with a breaking change on its schema.

Resolution

After evaluating the changes the team decided to roll out the new version to all nodes skipping the canary phase.

Root Causes

Breaking change in the JSON that was used to store metadata information to DynamoDB. This caused data written by the just deployed version (in canary) to break the old version still taking much of the customer traffic. In particular, the problem was manifesting itself when an user was creating a document using the new version

Impact

During ~50mins some users failed to load their new signed documents.

What Went Well?

  • Quickly pin pointed the root cause

What Didn't Go So Well?

  • Breaking change to a data schema.

Action Items

  • Team will figure out a way to version data written to out persistent store in order to not introduce breaking changes.

Timeline

  • @ 11:58 A new version of document-files gets deployed that makes some changes on how we store metadata document in DynamoDB
  • @ 11:59 As soon the canary was out the synthetic: ui-sign-sla prod started to fail
  • @ 12:02 The AllOps engineer gets paged and immediately ACKs the alert
  • @ 12:12 The AllOps engineer individuated the culprit and contacted the team responsible for the failing component
  • @ 12:40 The team recognised that an ongoing canary deploy of document-files was the probable issue. Somehow the old nodes were not able to read data written by the canary (still to confirm).
  • @ 12:42 The team decided to speed up the canary process and promote to all nodes
  • @ 12:45 The deployment finished and the functionality was restored
  • @ 12:50 Incident resolved
Posted Sep 19, 2022 - 12:25 UTC

Resolved
The investigation is complete and service is fully available.
Posted Sep 19, 2022 - 11:45 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 19, 2022 - 11:32 UTC
Investigating
We are investigating a possible service disruption with Nitro Sign. Stay tuned for further updates.
Posted Sep 19, 2022 - 11:02 UTC