How an MSSP Turned an Upgrade Issue into a Customer Confidence Builder
A blog series that shines a spotlight on real-world moments where network engineers use BackBox security-centric automation to save their organizations from costly downtime and surprises.
Synopsis
During a planned maintenance window for firewall upgrades, a Managed Security Service Provider (MSSP) customer of BackBox encountered an issue that required considerable time to troubleshoot and resolve. This included escalation to senior engineers, on-site physical intervention, and hardware replacement.
The lessons learned were quickly used to write a BackBox automation to audit the rest of the firewall estate, identify devices that were exposed to this issue, and allow planned remediation ahead of any further upgrades.
These steps provided assurance for future work and helped to avoid further costly intervention.
Problem
The MSSP works with a large number of enterprise and government customers. During the onboarding of a new customer, their Palo Alto firewall estate needed to be upgraded to bring them to an appropriate level. The MSSP team uses BackBox for onboarding activity and, therefore, built an automation that would take each firewall through the necessary multi-step upgrade from FortiOS 9.1.x → 10.1.0 → 10.2.0 → 10.2.4-h4.
Initially, the work progressed as planned. However, when the upgrade to 10.2.0 completed and the firewall rebooted, it did not come back up. A P1 incident was raised and an investigation was started which revealed that the device was stuck in a boot loop and could not get past POST.
Through manual intervention and trial and error, eventually the MSSP team discovered that some SFPs were causing the boot loop. Once these were removed the system booted as usual. It transpired that the “faulty” SFPs were non-OEM and had never been listed as supported. Despite working just fine in earlier versions of code, the system no longer recognized the devices once upgraded to 10.2.
When the SFPs were replaced with officially-supported models, the firewalls booted successfully and the upgrade cycle was completed.
Impact
To resolve the initial problem, the MSSP team had to bring in additional resources, including escalation to senior engineers and facilitation of physical intervention. The issue was eventually resolved after 4.5 hours, which meant that the planned maintenance window was exceeded by 2.5 hours and an incident was declared. Fortunately, the environment was highly available, so no customer outage was experienced. However, risk exposure increased during that time and, ultimately, 32 hours were spent on troubleshooting and response.
Mitigation and Avoidance
Following this kind of issue, the next step is to understand and mitigate any additional exposure. Prior to adopting BackBox, the options would have been some combination of:
- Physical inspection – Have local hands and eyes check each device
- Find experts with the right programming skills to write a script to log in to each device separately and interrogate it
While option 2 seems relatively quick and easy, in a large complex environment that is not always the case. Enter BackBox…
Because BackBox was already used to manage these firewalls for backup, inventory, and task automation, the foundation was in place to access and interrogate the devices for relevant information. With a no-/low-code approach to automation, BackBox provides the framework for users to create automations quickly and efficiently. There’s no need for the expensive Python/Ansible team to get involved.
- An automation task to log in to each device and check the SFPs was created and tested in about 30 minutes.
- The nearly 100 devices were added to a group within BackBox, and a job quickly configured and run. Within minutes, each of the devices had been interrogated.
- BackBox generated a consolidated report showing which devices had supported SFPs, and which did not.
By the time the dust settled from the initial issue and people were thinking about next steps, the MSSP team had the data in hand to make an informed plan.
These checks, and others like them, can be added to automated pre-upgrade checks, or regularly run compliance checks to make sure that there are no surprises in the future.
Assurance
We all know that bad things happen, and that we are measured on how we respond to those things. Some also believe that out of adversity comes opportunity. In this case, the MSSP in question seized the opportunity to show a new customer not only how well they respond to unexpected issues, but also how they take proactive measures to avoid further issues.
As a new customer going through the discomfort of moving critical business infrastructure from one provider to another, while this issue no doubt caused some concern, it must also surely have provided reassurance that they had chosen the right service provider.
Outcomes
- An MSSP’s new customer was brought into compliance with no downtime.
- With no-/low-code automation, in less than an hour the MSSP had the data needed to make an informed plan to upgrade the entire firewall estate and mitigate risk.
- Automated, proactive pre-upgrade and compliance checks now help avoid costly intervention.
Conclusion
For an MSSP or internal NetSecOps team, having the expertise to quickly resolve issues is a huge benefit. However, giving your teams the right tools to help them also use lessons learned to proactively avoid further issues is golden. It’s the kind of difference that makes or breaks reputations.