Degraded database performance

Incident Report for Onfleet

Postmortem

Summary of events (times in PDT):

20:37 - A database indexing operation was started to add functionality for an upcoming release for contactless signatures
20:45 - As a result of the indexing operation, disk IO times began to spike causing database operations to be queued and slow down
21:00 - The indexing operation was cancelled on the primary DB node but follow-on effects had already begun to put increased IO pressure on the primary DB instance
21:20 - In an effort to restore normal functionality in the most expedient manner, we failed over to our secondary DB node
21:55 - The new primary node warmed up sufficiently to handle production traffic at moderately degraded levels
22:00 - We began to terminate long-running database queries to take pressure off the DB
22:20 - Metrics returned to nominal levels

Rationale for the database operation:

The motivation for this database operation was to add support for contactless deliveries. We'll be shipping this feature soon once we have a solid plan in place for how to perform the DB operation.

Understanding the follow-on effects:

What hindered recovery in database performance was a series of suboptimal query patterns which poisoned the cache and put increased load on the disk for the primary. These queries were primarily due to the use of the GET /tasks endpoint by API clients. They were able to time out externally while remaining long-running on the database.

What could have been improved:

The performance characteristics of the indexing operation should have been better understood ahead of the operation
We should have waited until closer to the bottom of our peak load to perform this DB operation (~+120 mins)
Long-running database operations should never have continued beyond the request lifecycle
We should have begun to terminate these long-running queries earlier

What we'll be changing in the future:

We will do better planning for indexing updates like this in the future
These production events will be planned further in advance and publicly visible on our status page
We will ensure database operations cannot continue beyond the request lifecycle
We will put additional monitoring and logging in place to ensure these kinds of queries are detected in advance

Conclusion:

The database operations we performed should have been planned and modeled better to reduce risk. In response to the COVID-19 crisis, it has been a priority of ours to support our customers and ongoing community efforts. We shipped this change too casually and will ensure this operation can be done in a safe manner. We will be deprecating the GET /tasks endpoint for the API as its characteristics do not allow for it to be efficient as it is not paginated. The GET /tasks/all endpoint is the preferred choice here as it is more efficient, is paginated, and has a broader feature set. We will coordinate with customers to ensure a smooth transition for integrators. We will be shipping code to ensure that long-running database operations cannot extend beyond the request lifecycle (i.e. they time out on the same timescale as the request at the load balancer). We apologize for this blip in performance and want you to know that we can handle increased load; this event was not related to limitations in infrastructure.

Posted May 01, 2020 - 12:43 PDT

Resolved

This issue has now been fully resolved. Thank you for your patience. A new database index required for upcoming functionality to help with COVID-19 efforts appears to have been created incorrectly which resulted in massive database degradation. We will be following up with a full post-mortem tomorrow. If you have any questions in the meantime, please email us at support@onfleet.com.

Posted Apr 30, 2020 - 22:23 PDT

Update

We are expecting to be fully back to baseline performance metrics in the next 5-10 mins.

Posted Apr 30, 2020 - 21:54 PDT

Update

The underlying issue has been resolved and we are now just waiting for the primary database node to complete its warm-up sequence.

Posted Apr 30, 2020 - 21:39 PDT

Monitoring

The failover operation has been successful and we are starting to see recovering response times.

Posted Apr 30, 2020 - 21:23 PDT

Update

We are stepping down our primary node in order to provide relief as quickly as possible.

Posted Apr 30, 2020 - 21:13 PDT

Identified

We are currently investigating degraded performance in our main database cluster.

Posted Apr 30, 2020 - 21:08 PDT

This incident affected: Dashboard and API.