RIPE Atlas software probes degradation

Incident Report for RIPE NCC

Resolved

This incident has been resolved.

The root cause was a backend that started answering slower than expected and caused many internal requests to take too long, clogging up the processing pipelines. These requests were used when probes reconnect, handled by our event processor. The same event processor is also responsible for handling signals of liveness from controllers managing the probes; since these were also delayed, the system eventually determined that these controllers are not healthy and therefore stopped ending probes to them. The (software) probes that stayed connected never saw this problem, but the ones which tried to reconnect accumulated over time, causing the degradation.

We remediated the issue by adding low timeouts to the non-critical parts of the pipeline and thus processing the backlog quickly. As a follow-up we'll add more asynchronous processing to our events, preventing this type of issue from appearing again.

Posted Jun 08, 2026 - 16:43 CEST

Monitoring

We implemented the necessary improvements to help the situation. We're monitoring this solution.

Posted Jun 08, 2026 - 15:06 CEST

Identified

Over the weekend of 6-7 June we started experiencing an issue where software probes that disconnected for any reason were not allowed to connect again. Over time this caused a gradual decrease in the number of connected probes, up to 20% of software proves or about 10% of the total probe population.

We identified the root cause to be a delay of processing of internal control messages. We executed ad-hoc measures that allowed most probes to temporarily connect again, and we are currently working on a proper solution.

Posted Jun 08, 2026 - 09:00 CEST

This incident affected: Non-Critical Services (RIPE Atlas).