US1 service degradation

Incident Report for Storj DCS

Postmortem

Overview

On May 20, 2024, at approximately 14:56 UTC, the Storj US1 satellite experienced a service disruption due to an issue that occurred during routine database maintenance. The incident affected only the US1 satellite, while the AP1 and EU1 satellites remained fully operational throughout the event.

Root Cause

During a routine, manual database maintenance procedure, a transaction was inadvertently left open while other work was being performed. Open database transactions can cause issues because they hold locks on the affected data, preventing other operations from accessing or modifying that data until the transactions are committed or rolled back. This open transaction eventually caused write operations to fail, and as the situation progressed, read operations were also impacted, leading to a general service disruption.

Impact

The incident affected customers and applications relying on the Storj US1 satellite for storage and retrieval operations. Uploads were the first to be impacted, followed by downloads. The AP1 and EU1 satellites were not affected by this incident and continued to operate normally.

Timeline

14:55 UTC: Routine database maintenance began on the US1 satellite.

14:55 UTC: A transaction was inadvertently left open during the maintenance process.

14:56 UTC: Write operations began to fail due to the open transaction.

14:59 UTC: The on-call team received a page and started investigating the issue.

15:22 UTC: The open transaction was closed, and the team started the recovery process.

15:23 UTC: Operations were restored.

Remediation and Prevention

To address the issue, the open transaction was identified and closed, allowing the satellite to recover and resume normal operation. To prevent similar incidents from occurring in the future, we will be implementing the following measures:

‌

Reviewing and updating our database maintenance procedures and training to ensure that all transactions are properly closed before moving on to other tasks.
Implementing additional monitoring and alerting mechanisms to detect and notify the team of any open transactions that exceed a predetermined duration.
Conducting thorough post-mortem analysis to identify any other potential improvements to our processes and systems, including building tools that ease or eliminate the need for manual maintenance.

‌

We apologize for any impact this service disruption may have caused our customers and users. We are committed to learning from this incident and continuing to improve the reliability and resilience of our platform.

Posted May 23, 2024 - 19:45 UTC

Resolved

This incident has been resolved.

Posted May 20, 2024 - 15:59 UTC

Update

We are continuing to investigate this issue.

Posted May 20, 2024 - 15:31 UTC

Update

We are continuing to investigate this issue.

Posted May 20, 2024 - 15:24 UTC

Investigating

We are currently investigating this issue

Posted May 20, 2024 - 15:11 UTC

This incident affected: US1 (US1 - API, US1 - Linksharing, US1 - Gateway, US1 - Select).