APM - Application Node app-11 performance issues
Incident Report for AppNeta Services
Postmortem

Summary:

A subset of customers (those using the app-11 application node) were affected by a data processing issue, which lasted from 2:24pm EST to 6:39pm EST on August 31st, 2021. During this time, many Delivery path states were marked as invalid. The critical gap in monitoring that led to the increased duration and wider customer impact of the issue will be addressed.


On Tuesday, August 31st, at 2:24pm EST, the messaging bus servicing the app-11 application node experienced a fault which increased load on both the application node and the database. Our engineering team was notified of this increased load at 2:53pm EST. Unable to identify any user facing impact, we continued to monitor the system. At 4:41pm EST, we received another alert pertaining to the ingestion of traceroute data, and we began investigating delays in processing inbound traceroute data. At 4:54pm EST, we received our first report of customer facing impact, reported as a large portion of Delivery paths in a failed state. We posted a status page at 5:04pm EST for the issues with the app-11 application node. Our engineering team continued investigating the issue and initiated a restart of the app-11 application node at 6:03pm EST. The restart was completed by 6:09pm EST and both the messaging bus and database were confirmed stable at 6:39pm EST. We quickly found that the APM service restart addressed many of the invalid Delivery path statuses, but not all.  The following morning, AppNeta Engineering began forcing reset of the Delivery path states via the API, an action which was completed by 2:17pm EST and represented the final resolution of the issue.

AppNeta Engineering has conducted a post-mortem and root cause analysis of the incident and have identified a handful of issues to address:

First is addressing the issue of the fault in the messaging bus. We are still investigating the nature of this fault and once understood, a fix will be put in place.

Second is the matter of the delay in addressing invalid Delivery path statuses. APM service restarts are expected to force reset of all Delivery path statuses as Monitoring Points reconnect and new data comes in. As mentioned above, we found that this was not the case during this incident. AppNeta Engineering is still investigating the reason for this and will address it.

Finally, this incident has identified a gap in our monitoring and alerting coverage. The most obvious user-facing impact of this incident was the set of Delivery paths being put into a failed state, a data point which we are not currently monitoring. As an output of our post-mortem, our Engineering team will move quickly to fill this gap. This will ensure that should we ever experience a similar issue, we will identify and mitigate it much more quickly.

Posted Sep 15, 2021 - 18:37 EDT

Resolved
The remaining paths on app-11 have been corrected. This issue is now resolved.

AppNeta Engineering is investigating the root cause of this issue and will post a post-mortem once our investigation is complete.
Posted Sep 02, 2021 - 14:36 EDT
Update
The majority of paths on app-11 with an incorrect status have been corrected. We are continuing to work towards recovery of a small subset of remaining paths
Posted Sep 01, 2021 - 17:30 EDT
Update
On the app-11 application node, we experienced performance issues beginning at 14:26PM EDT. A fix was put in place and recovery began at 18:40PM EDT.
A subset of path statuses have not been updated and our team is working towards recovery. We’ll share update as we progress further.
AppNeta Engineering is investigating the root cause of this issue and will post a post-mortem once our investigation is complete.
Posted Sep 01, 2021 - 11:45 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 31, 2021 - 19:06 EDT
Identified
We’ve identified performance issues for the users logging in to app-11 application node.
Our engineers are investigating the cause. As we are working on the resolution, services may become unavailable for a short period of time.
Posted Aug 31, 2021 - 17:04 EDT
This incident affected: AppNeta Performance Manager - App, AppNeta Performance Manager - Login, Enterprise Monitoring Points, Global Monitoring Points, Experience Monitoring, Delivery Monitoring, Usage Monitoring, AppNeta APIs and Help & Resources (AppNeta Help Desk).