US1 - Upload and Download failing requests

Incident Report for Storj DCS

Postmortem

Summary of the Storj US1 Service Disruption on August 7, 2025

Overview:

On August 7, 2025, at approximately 21:04 UTC, the Storj US1 satellite experienced a performance degradation due to an unusually high volume of concurrent uploads to the same objects. The incident affected only the US1 satellite, while the AP1 and EU1 satellites remained fully operational throughout the event.

Root Cause:

The primary root cause was an exceptionally high volume of simultaneous uploads targeting the same objects, which triggered a bottleneck in the database due to transaction contention. This led to a cascade of issues, including database locking, request timeouts, and connection pool exhaustion. Database transactions can cause issues because they hold locks on the affected data, preventing other operations from accessing or modifying that data until the transactions are committed or rolled back.  The increased level of long running transactions eventually leads to request timeouts.

Impact:

The incident affected customers and applications relying on the Storj US1 satellite for storage and retrieval operations. The AP1 and EU1 satellites were not affected by this incident and continued to operate normally.

Timeline:

21:04 UTC: Database transaction locks started to increase.

21:14 UTC: The on-call team received a page and started investigating the issue.

21:55 UTC: A fix was implemented and the on-call team started monitoring the results.

22:31 UTC: The on-call team started investigating another increase in error rates.

22:45 UTC: Error rates trended back down to normal and the on-call team continued to monitor for further issues.

23:05 UTC: Operations were fully restored.

Remediation and Prevention:

To address the issue, we periodically reset the state of connections thus avoiding a cascading growth of contention related errors. To prevent similar incidents from occurring in the future, we implemented the following measures:

  1. Identified and resolved sources of contentions between database transactions by implementing appropriate code updates.
  2. Implemented additional monitoring and alerting mechanisms to detect and notify the team of increased levels of database locks.
  3. Conducted thorough post-mortem analysis to identify any other potential improvements to our processes and systems, including building tools that ease or eliminate the need for manual maintenance.

We apologize for any impact this service disruption may have caused our customers and users. We are committed to learning from this incident and continuing to improve the reliability and resilience of our platform.

Posted Aug 22, 2025 - 19:25 UTC

Resolved

This incident has been resolved.
Posted Aug 07, 2025 - 23:05 UTC

Update

Error rate are normal and we are continuing to monitor for further issues.
Posted Aug 07, 2025 - 22:45 UTC

Update

We are investigating another increase in error rates.
Posted Aug 07, 2025 - 22:31 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Aug 07, 2025 - 21:55 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Aug 07, 2025 - 21:30 UTC

Investigating

We are investigating upload/download failures on US1
Posted Aug 07, 2025 - 21:25 UTC
This incident affected: US1 (US1 - Linksharing, US1 - Gateway, US1 - Select).