API failures on 2021/05/20
Incident Report for Onfleet
Resolved
At 17:37 PDT on May 20, 2021, we began a deploy to production as part of our regular release process. This release included a change which caused an API operation to fail. As these failures happened, they caused the workers servicing the related API requests to crash. These failures caused clients to retry these operations, which created a cycle of increased occurrences of this request and more crashes. These crashes caused a large percentage of API requests to fail which may have caused a significantly degraded experience for users of all of our applications and may have prevented users from using our product entirely.

We deployed a fix for this issue at 17:50 PDT.

Users were likely most affected between 17:42 and 17:50 PDT. Subscribers of our status page would have received notifications about this incident.

Root Cause: The error we introduced was not covered by a test case. Our releases for the API service are tested internally with a unit test suite and with an integration test suite. In this instance, the particular API operation was not fully exercised before the release was made to production.

Improving our test coverage is a key focus of our engineering team. We are continuing to improve our test coverage for existing code and implementing more strict guidelines internally about what kinds of tests are required for new code to be approved.

We apologize to any customers who were affected during this period; we know that you depend on our product to be available and to be stable and we take this responsibility seriously.
Posted May 20, 2021 - 17:30 PDT