The high latency incident resulting from a problematic database cluster version upgrade has now been resolved and all system metrics are back to baseline levels.
We are now working on ensuring that all container state is correct to guarantee that assignments and task completions made during this incident are properly synchronized across all database resources.
Now that system availability and performance has been restored, we are hard at work internally and with our database hosting partner to understand why this previously tested update had such an adverse effect on database performance.
While we tried everything we could to get performance to acceptable levels on the new version we were updating to, we failed to do so in a reasonable time and had to roll back to the previous version. Upon successful roll back, our primary node took some time to refill its cache due to excessive disk I/O use so we had to replace it with another node in order to load up all cache resources more quickly.
At this time, it would appear that a single endpoint is responsible for the large performance degradation that was experienced by all users once the cluster update operation was applied. This is not something that we caught in our testing on non-production environments and we are investigating the reason for our missing this important issue in our evaluation processes. Specifically, it looks like index handling over a single database record property changed in such a way that every time a dashboard would issue the problematic request, our database primary node would be forced to read data from disk instead of memory, which quickly exhausted the burst credit allocation available from AWS EBS for these situations.
If you want to learn more about this issue, please do not hesitate to email us at email@example.com
Thank you for your patience while our team worked on resolving this incident.