Emergency Maintenance - IBX SP4
Scheduled Maintenance Report for Huge Networks
Postmortem

São Paulo, 22/07/2023

On 21/07/2023 (DD/MM/YYYY) we have seen a significant increase in memory consumption on our Juniper MX routers, specifically on our POP in the Equinix SP4 datacenter.

While our traffic growth is impressive, we are a long way from the capacity of our network equipment, especially edge routers. However, on the date of 21/07/2023 at approximately 10:09:00 AM, we observed a severe increase in RAM memory consumption, and consequently, crash and restart of the routing engine (fpc/pfe).

Immediately, our team tried to gain access through our OOB links, which we even have two different providers, but we were unsuccessful. The equipment was unresponsive and unable to respond to any remote access attempt.

At the same time, we called the team on site, which arrived at the Router in less than 5 minutes and gained access via the console port, however, a few seconds earlier, the routing engine automatically returned.

Upon returning, our team immediately identified the cause of the problem, and soon we resorted to the troubleshooting recommended by the manufacturer.

During the day, we performed several maneuvers to identify the reason for the abnormal consumption of RAM memory by the Routing Engine, but we found several failures involving a single possibility: memory leak.

We turned off several reflectors of routes, routing instances, and everything that could cause excessive consumption of RAM memory, since the only process that was consuming excessively was with the name "rpd", which is mainly responsible for routing.

Unfortunately we were unsuccessful, but we got answers.

On all Junos OS and Junos OS Evolved platforms with rib-sharding enabled, memory leak will be observed in rpd when restart routing is performed. If system is up from long time and restart routing performed multiple times can exhaust system memory that causes to process crash or configuration are not effective/applied because of lack of memory then it is possible that it will impact traffic.

Despite the problem having been overcome, we chose to invest in Nokia 7750 Routers with the aim of having an integrated, solid architecture, and working in parallel with our Juniper backbone.

Problem Report Search (from Juniper Networks):

PR1716431

PR1662239

Posted Jul 22, 2023 - 19:05 GMT-03:00

Completed
The scheduled maintenance has been completed.
Posted Jul 22, 2023 - 02:01 GMT-03:00
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Jul 22, 2023 - 01:01 GMT-03:00
Scheduled
We will be rebooting our main edge routers on IBX SP4 in order to solve an memory leak firmware bug on JunOS routers. During this reboot, your BGP session may be unavailable up to 10 minutes.
Posted Jul 21, 2023 - 15:54 GMT-03:00
This scheduled maintenance affected: South America (São Paulo).