High latency

Incident Report for Onfleet

Resolved

The high latency incident resulting from a problematic database cluster version upgrade has now been resolved and all system metrics are back to baseline levels.

We are now working on ensuring that all container state is correct to guarantee that assignments and task completions made during this incident are properly synchronized across all database resources.

Now that system availability and performance has been restored, we are hard at work internally and with our database hosting partner to understand why this previously tested update had such an adverse effect on database performance.

While we tried everything we could to get performance to acceptable levels on the new version we were updating to, we failed to do so in a reasonable time and had to roll back to the previous version. Upon successful roll back, our primary node took some time to refill its cache due to excessive disk I/O use so we had to replace it with another node in order to load up all cache resources more quickly.

At this time, it would appear that a single endpoint is responsible for the large performance degradation that was experienced by all users once the cluster update operation was applied. This is not something that we caught in our testing on non-production environments and we are investigating the reason for our missing this important issue in our evaluation processes. Specifically, it looks like index handling over a single database record property changed in such a way that every time a dashboard would issue the problematic request, our database primary node would be forced to read data from disk instead of memory, which quickly exhausted the burst credit allocation available from AWS EBS for these situations.

If you want to learn more about this issue, please do not hesitate to email us at support@onfleet.com.

Thank you for your patience while our team worked on resolving this incident.

Posted Nov 28, 2018 - 16:45 PST

Monitoring

The underlying database problem has now been fully rolled back and we are starting to see much improved response times. Most dashboards should now load within a few seconds. We will soon start to verify container consistency and other operations in order to restore correct state as quickly as possible.

Posted Nov 28, 2018 - 15:34 PST

Update

We have rolled back this change as we were not able to determine the cause of the problem quickly enough, we now expect response times to go back to normal levels within the next 15 minutes.

Posted Nov 28, 2018 - 14:52 PST

Identified

We are currently working to resolve a high latency situation affecting a recently upgraded database cluster.

The cache is taking longer to fill than expected.

Posted Nov 28, 2018 - 14:17 PST

This incident affected: Dashboard and API.