Traceroute Data Processing Delays
Incident Report for AppNeta Services
Postmortem

Summary: All APM customers experienced delays in traceroute processing for several hours on August 31st, 2021, caused by a fault in the messaging bus of one of our application nodes. During the course of this issue, new and recent traceroute processing was delayed, taking up to an hour to appear in APM. The gap in monitoring that led to the increased duration and wider customer impact of the issue will be addressed.


On Tuesday, August 31st, at 2:24pm EST, the messaging bus servicing the app-11 application node experienced a fault which destabilized the Monitoring Point connections for the application node. Our engineering team was notified of an increase in load against the database at 2:53pm EST. Unable to identify any user facing impact, we continued to monitor the system. At 4:41pm EST, we received another alert pertaining to the ingestion of traceroute data, and we began investigating delays in processing inbound traceroute data. After confirming delays in processing inbound traceroute data caused by a large influx of new data, we posted a statuspage for this incident at 5:19pm EST. Our engineering team continued investigating the issue and initiated a restart of the app-11 application node at 6:03pm EST. The restart was completed by 6:09pm EST. This restart mitigated the fault in the messaging bus, which in turn mitigated the influx of traceroute data that was causing delays and we could see the issue subsiding by 7:21pm EST. We monitored the issue through the night and considered it resolved the following morning at 8:38 pm EST.

AppNeta Engineering has conducted a post-mortem and root cause analysis of the incident and have identified a handful of issues to address:

First is addressing the issue of the fault in the messaging bus. We are still investigating the nature of this fault and once understood, a fix will be put in place.

Secondly, this incident has identified a gap in our monitoring and alerting, specifically where Monitoring Point stability is concerned. This instability, which was caused by the increased load on the database, prompted the influx of traceroute data which drove up the latency in processing inbound traceroute data for all APM customers. During this time, new and recent traceroutes were unavailable in APM for up to an hour, either on the Routes chart or in the individual Path Details Timeline. While our monitoring did detect the increased latency, identification of Monitoring Point connection instability would have led to a faster resolution of this issue.

Posted Sep 15, 2021 - 18:55 EDT

Resolved
This incident has been resolved.
Traceroute data processing has fully recovered and new traceroute data is no longer being delayed before becoming available in the AppNeta Performance Manager
Posted Sep 01, 2021 - 17:34 EDT
Monitoring
Traceroute data processing has fully recovered and new traceroute data is no longer being delayed before becoming available in the AppNeta Performance Manager
Posted Sep 01, 2021 - 11:38 EDT
Identified
Traceroute data processing is still impaired but is starting to see recovery. New traceroute data continues to be delayed by approximately 33 minutes before becoming available in the AppNeta Performance Manager.
Posted Aug 31, 2021 - 20:01 EDT
Investigating
We are currently experiencing delays processing traceroute data. New traceroute data is being delayed by approximately fifteen minutes before becoming available in the AppNeta Performance Manager. Users are not be able to see most recent traceroutes on their Network Paths and Routes panel and newly created paths won’t display Routes. This is affecting all AppNeta Performance Manager users.
Our engineers are investigating the cause.
Posted Aug 31, 2021 - 17:19 EDT
This incident affected: Delivery Monitoring.