Summary: All APM customers experienced delays in traceroute processing for several hours on August 31st, 2021, caused by a fault in the messaging bus of one of our application nodes. During the course of this issue, new and recent traceroute processing was delayed, taking up to an hour to appear in APM. The gap in monitoring that led to the increased duration and wider customer impact of the issue will be addressed.
On Tuesday, August 31st, at 2:24pm EST, the messaging bus servicing the app-11 application node experienced a fault which destabilized the Monitoring Point connections for the application node. Our engineering team was notified of an increase in load against the database at 2:53pm EST. Unable to identify any user facing impact, we continued to monitor the system. At 4:41pm EST, we received another alert pertaining to the ingestion of traceroute data, and we began investigating delays in processing inbound traceroute data. After confirming delays in processing inbound traceroute data caused by a large influx of new data, we posted a statuspage for this incident at 5:19pm EST. Our engineering team continued investigating the issue and initiated a restart of the app-11 application node at 6:03pm EST. The restart was completed by 6:09pm EST. This restart mitigated the fault in the messaging bus, which in turn mitigated the influx of traceroute data that was causing delays and we could see the issue subsiding by 7:21pm EST. We monitored the issue through the night and considered it resolved the following morning at 8:38 pm EST.
AppNeta Engineering has conducted a post-mortem and root cause analysis of the incident and have identified a handful of issues to address:
First is addressing the issue of the fault in the messaging bus. We are still investigating the nature of this fault and once understood, a fix will be put in place.
Secondly, this incident has identified a gap in our monitoring and alerting, specifically where Monitoring Point stability is concerned. This instability, which was caused by the increased load on the database, prompted the influx of traceroute data which drove up the latency in processing inbound traceroute data for all APM customers. During this time, new and recent traceroutes were unavailable in APM for up to an hour, either on the Routes chart or in the individual Path Details Timeline. While our monitoring did detect the increased latency, identification of Monitoring Point connection instability would have led to a faster resolution of this issue.