The effects of network downtime and ways to fix it

Even the most meticulously planned and operated networks can still be plagued with bouts of downtime.
Network downtime can be intentional, planned moments where network engineers purposely shut down the network for a period of time to implement necessary configuration changes. Other times, though, downtime is an unintentional network interruption that can cause administrator headaches. By the time the network is operational again, the damage is already done.

Mere minutes of downtime can cost a business thousands of dollars. When those minutes extend into hours, the consequences can be substantial. Even when network professionals know the source of the outage, it can still take time to get the network back up and running. This can lead to not only significant loss in revenue for businesses but also challenges for network teams, end users and the network itself.

Causes of network downtime
According to Frank Kyazze, founder and CEO of cybersecurity consultancy firm GRC Knight, the most common causes of network downtime are the following:

Human error and misconfigurations. Incorrect changes or miscommunication between IT teams, as well as misconfigured devices and improper rules.
Security threats. Cyberattacks and accidental or unintentional vulnerabilities in hardware or software.

Human error and misconfigurations
Kyazze said human error, in particular, is a frequent cause of network downtime. For example, if a network team administers a change that denies access to resources a group requires, that group will likely experience downtime.
Similar sentiments were shared by Chris Grundemann, CEO and co-founder of FullCtrl — a network automation and software development company — and co-founder of the Network Automation Forum. Grundemann said one of the most common causes of network downtime is when a network administrator makes a change without realizing how it will affect the network.
“There are outages that are caused by hardware, software failures and software bugs, but they’re less common than the human error side of things,” he said.
Wim Gerrits, founder and CEO of NetYCE, a network automation company, said incorrect configuration changes driven by complexity are among other top culprits of network downtime. As networks grow more complex, network engineers sometimes lack the understanding of how a change relates to the rest of the network, making changes more difficult.
Although incorrect changes can lead to downtime, Kyazze said proper change management can help prevent it.
“Change management is everything when it comes to trying to battle against human error,” he said. “Any sort of critical change that could impact a network should be documented, reviewed, tested and approved before occurring, to mitigate the risk of human error leading to network downtime.”
Security threats
Security vulnerabilities present in network hardware, software or firewalls can also lead to downtime, Kyazze said. For example, a hacker who intentionally tampers with a device could shut down the network and create downtime. Inexperienced network administrators can also unintentionally create downtime if they misconfigure security policies in devices.

The effects of network downtime on business
When network professionals design a network, they typically implement failover measures so that if something goes down, traffic can reroute to backup links that ensure the network remains operational.
When a backup link isn’t available, however, downtime is more likely to occur, and network engineers might struggle to identify and remediate the issue, Gerrits said. It’s even more challenging if the issue is something only the now inoperative network can fix.
“If corporate network resources are down and you need them to respond to a network downtime issue, the company might be at a standstill until, miraculously, the network comes back up,” Kyazze said.
When network downtime occurs, organizations aren’t able to conduct basic operations, Gerrits said. Challenges that can occur because of network downtime include the following:

Employees can’t connect to applications.
Employees can’t connect with each other.
Businesses can’t serve customers.

Furthermore, when the network goes down, it can affect other areas of business. Other computer systems and applications rely on the network; when an outage happens, it creates a ripple effect that disrupts the entire IT stack.
“The [network] is fundamental to the IT stack,” Gerrits said. “If the network doesn’t run, there’s no business.”

How to mitigate network downtime
Network teams can fix downtime with network management and monitoring tools that detect or remediate issues. These tools are typically included in a domain management network or a collection of network automation tools.
Network domain management
Network professionals typically use separate networks that operate as part of a management domain, Gerrits said. Management domain tools can help teams monitor network performance, protect the network against threats and fix issues that occur.
Network management domain tools can include the following:

Network monitoring tools. These provide insight into network performance and can detect potential problems before they occur.
Fault management tools. These actively monitor network performance to detect, diagnose, isolate and remediate issues.
Network compliance tools. These audit network devices to ensure they conform with security standards, identify suspicious activity and fix security inconsistencies.

If an organization doesn’t have separate management networks with these tools, Gerrits said network professionals have to find alternative methods to detect the source of an outage and remediate it.
Network automation tools
Automated network tools can help professionals implement configuration changes, Grundemann said. However, he added that automation doesn’t necessarily remove the possibility of human error. Network administrators write scripts that automatically implement changes, so if a mistake is present in a script, the tool will continue to deploy the change with the error. Some advanced network automation tools have verification capabilities, but they don’t eliminate human error.
“It’s not that automation cures all human error, but it definitely is a big help and makes sure configs stay standardized and that you can have testing built in,” Grundemann said.
Rick Osteen, a network engineer with 25 years of experience, said network automation tools can be helpful, but revamped tools with AI capabilities could be more efficient. For example, these tools could have predictive analysis capabilities that detect changes before they occur and reduce how often network professionals identify changes manually.

Prevent downtime before it occurs
Kyazze said one of the most effective ways to fix network downtime is to prepare for it. Thorough preparation can help prevent downtime before it occurs, but it can also help teams understand what to do in the event of an outage. For example, Kyazze said he recommends organizations simulate test environments for scenarios like downtime and experiment with how they respond to an incident.
Osteen said another way teams can prevent downtime is by creating a method of procedure (MOP) to plan how to implement changes. A MOP is a set of instructions that details how to implement a process. Osteen said he recommends network professionals write down each step of the change configuration process and refer to the instructions as they implement the change.
When teams use a MOP, however, it should be a collaborative process, Grundemann said. At least one network professional should work on the MOP itself — which includes the plans and configurations, from the frontend to the backend — and another should approve it.
“A second set of eyes looks at it and makes sure you didn’t misspell an interface name or make a typo,” Grundemann said. “At least two people have signed off on it, and then [they can] go out and implement that.”

Network downtime: Positive when planned
The term downtime is often interpreted as negative, but planned downtime can have a positive effect on the network. Administrators schedule downtime when they need to shut down the network infrastructure for upgrades or maintenance purposes.
Because planned downtime is essential to the change configuration process, it’s important for network teams to use best practices to avoid an unintentional outage. For example, network teams typically wait to administer changes during inactive hours so that if something goes wrong, it won’t affect end users.
In addition, Grundemann said most network professionals design networks with redundancies in place. A network could have additional switches, routers and multiple exit points. Redundancies help the network stay available if hardware fails and keep the network active during upgrades.
If network administrators want to upgrade devices, they can keep the additional devices active while the selected hardware updates. In this case, downtime doesn’t occur on the network, but on the individual devices alone, while the network remains operational, Grundemann said.
Although configuration changes could lead to downtime, changes aren’t something to avoid for the sake of preventing downtime. According to Gerrits, most downtime is planned, but network professionals should ensure they understand the complexity of their networks before implementing a change.
“As long as you know that what you’re doing isn’t going to cause something else to fall over, [downtime] isn’t a problem,” Gerrits said.
Deanna Darah is associate site editor for TechTarget’s Networking site. She began editing and writing at TechTarget after graduating from the University of Massachusetts Lowell in 2021.