How an MSSP Transformed an Upgrade Issue into a Confidence Builder for Customers

Tony Dalton

May 15, 2024

This blog series spotlights real-world moments in which network engineers use BackBox security-centric automation to save their organizations from costly downtime and surprises.

Synopsis

During a planned maintenance window for firewall upgrades, a BackBox Managed Security Service Provider (MSSP) customer encountered an issue that required considerable time to troubleshoot and resolve. This included escalating to senior engineers, on-site physical intervention, and hardware replacement.

The lessons learned were quickly applied to develop a BackBox automation intended to audit the remaining firewall infrastructure, identify devices susceptible to this issue, and enable planned remediation before future upgrades. These measures guaranteed that upcoming initiatives were smoother and helped avoid further costly interventions.

Problem

The MSSP collaborates with numerous enterprise and government customers. When onboarding a new customer, their Palo Alto firewall infrastructure needed an upgrade to reach an appropriate level. The MSSP team utilizes BackBox Network Cyber Resilience Platform for onboarding activities and has consequently developed an automation system that guides each firewall through the necessary multi-step upgrade from PAN-OS 9.1.x to 10.1.0 to 10.2.0 to 10.2.4-h4.

The work progressed as planned initially. However, when the upgrade to 10.2.0 was completed and the firewall rebooted, it did not come back up. A P1 incident was raised, and an investigation was started, which revealed that the device was stuck in a boot loop and could not get past POST.

Through manual intervention and trial and error, the MSSP team eventually discovered that some SFPs were causing the boot loop. Once these were removed, the system booted as usual. The “faulty” SFPs were non-OEM and had never been listed as supported. Despite working just fine in earlier code versions, the system no longer recognized the devices once upgraded to 10.2.

When the SFPs were replaced with officially supported models, the firewalls booted successfully, completing the upgrade cycle.

Impact

To resolve the initial problem, the MSSP team had to bring in additional resources, including escalating to senior engineers and facilitating physical intervention. The issue was eventually resolved after 4.5 hours, which meant that the planned maintenance window was exceeded by 2.5 hours, and an incident was declared. Fortunately, the environment was highly available, so no customer outage was experienced. However, risk exposure increased during that time, and, ultimately, 32 hours were spent on troubleshooting and response.

Mitigation and Avoidance

In light of this issue, the next step is to understand and mitigate any further exposure. Before adopting BackBox, the options would have included some combination of:

Physical inspection – Have local hands and eyes check each device
Find experts with the right programming skills to write a script to log in to each device separately and interrogate it

While option 2 appears to be relatively quick and easy, that is not always true in a large, complex environment. Since BackBox was already utilized to manage these firewalls for backup, inventory, and task automation, the groundwork was set to access and interrogate the devices for relevant information. Using a no-/low-code approach to automation, BackBox offers a framework for users to create automations quickly and efficiently. There’s no need for the costly Python/Ansible team to be involved.

An automation task for logging into each device and checking the SFPs was created and tested in approximately 30 minutes.
Nearly 100 devices were added to a group within BackBox, and a job was quickly configured and run. Within minutes, each device was interrogated.
BackBox generated a consolidated report indicating which devices had supported SFPs and which did not.

By the time the dust settled from the initial issue and people were considering the next steps, the MSSP team had the data to make an informed plan.

These checks, along with similar ones, can be incorporated into automated pre-upgrade checks or regularly conducted compliance checks to ensure there are no surprises in the future.

Assurance

We all know that bad things happen and that we are measured by how we respond to them. Some also believe that out of adversity comes opportunity. In this case, the MSSP in question seized the opportunity to show a new customer not only how well they respond to unexpected issues but also how they take proactive measures to avoid further issues.

As a new customer experiencing the discomfort of moving critical business infrastructure from one provider to another, this issue undoubtedly caused some concern, but it must also surely have provided reassurance that they had chosen the right service provider.

Outcomes

An MSSP’s new customer achieved compliance without any downtime.
With no- or low-code automation, the MSSP acquired the necessary data in under an hour to develop a well-informed plan for upgrading the entire firewall estate and mitigating risk.
Automated, proactive pre-upgrade and compliance checks help prevent costly interventions.

Conclusion

Having the expertise to resolve issues quickly is a huge benefit for an MSSP or internal NetSecOps team. However, giving your teams the right tools to help them use lessons learned to avoid further issues proactively is golden. It’s the kind of difference that makes or breaks reputations.