Summary of events (times in PDT):
Rationale for the database operation:
The motivation for this database operation was to add support for contactless deliveries. We'll be shipping this feature soon once we have a solid plan in place for how to perform the DB operation.
Understanding the follow-on effects:
What hindered recovery in database performance was a series of suboptimal query patterns which poisoned the cache and put increased load on the disk for the primary. These queries were primarily due to the use of the GET /tasks endpoint by API clients. They were able to time out externally while remaining long-running on the database.
What could have been improved:
What we'll be changing in the future:
Conclusion:
The database operations we performed should have been planned and modeled better to reduce risk. In response to the COVID-19 crisis, it has been a priority of ours to support our customers and ongoing community efforts. We shipped this change too casually and will ensure this operation can be done in a safe manner. We will be deprecating the GET /tasks endpoint for the API as its characteristics do not allow for it to be efficient as it is not paginated. The GET /tasks/all endpoint is the preferred choice here as it is more efficient, is paginated, and has a broader feature set. We will coordinate with customers to ensure a smooth transition for integrators. We will be shipping code to ensure that long-running database operations cannot extend beyond the request lifecycle (i.e. they time out on the same timescale as the request at the load balancer). We apologize for this blip in performance and want you to know that we can handle increased load; this event was not related to limitations in infrastructure.