A subset of customers (those using the app-11 application node) were affected by a data processing issue, which lasted from 2:24pm EST to 6:39pm EST on August 31st, 2021. During this time, many Delivery path states were marked as invalid. The critical gap in monitoring that led to the increased duration and wider customer impact of the issue will be addressed.
On Tuesday, August 31st, at 2:24pm EST, the messaging bus servicing the app-11 application node experienced a fault which increased load on both the application node and the database. Our engineering team was notified of this increased load at 2:53pm EST. Unable to identify any user facing impact, we continued to monitor the system. At 4:41pm EST, we received another alert pertaining to the ingestion of traceroute data, and we began investigating delays in processing inbound traceroute data. At 4:54pm EST, we received our first report of customer facing impact, reported as a large portion of Delivery paths in a failed state. We posted a status page at 5:04pm EST for the issues with the app-11 application node. Our engineering team continued investigating the issue and initiated a restart of the app-11 application node at 6:03pm EST. The restart was completed by 6:09pm EST and both the messaging bus and database were confirmed stable at 6:39pm EST. We quickly found that the APM service restart addressed many of the invalid Delivery path statuses, but not all. The following morning, AppNeta Engineering began forcing reset of the Delivery path states via the API, an action which was completed by 2:17pm EST and represented the final resolution of the issue.
AppNeta Engineering has conducted a post-mortem and root cause analysis of the incident and have identified a handful of issues to address:
First is addressing the issue of the fault in the messaging bus. We are still investigating the nature of this fault and once understood, a fix will be put in place.
Second is the matter of the delay in addressing invalid Delivery path statuses. APM service restarts are expected to force reset of all Delivery path statuses as Monitoring Points reconnect and new data comes in. As mentioned above, we found that this was not the case during this incident. AppNeta Engineering is still investigating the reason for this and will address it.
Finally, this incident has identified a gap in our monitoring and alerting coverage. The most obvious user-facing impact of this incident was the set of Delivery paths being put into a failed state, a data point which we are not currently monitoring. As an output of our post-mortem, our Engineering team will move quickly to fill this gap. This will ensure that should we ever experience a similar issue, we will identify and mitigate it much more quickly.