Networks are becoming increasingly complex, making effective and efficient network management a challenge. With new and emerging technologies and the increasing adoption of cloud, users expect faster network speeds and seamless network availability. In addition, security threats are more advanced and agile. As the network incorporates several devices, tools, applications, systems, and now, work from home users, the complexity escalates. A larger scale of network amplifies the potential for failure.
Networks now need to be intelligent. Artificial intelligence (AI) and machine learning (ML) will be critical in automating network operations and optimizing end user experience. This paper looks at how AI, ML, and automation can change the existing network operations and simplify the life of network engineers. It explores how businesses can leverage AI/ML to make a true self-healing autonomous network, going beyond automation and work done by bots.
Effective network operations in a new age technology world
With millions of applications developed every day, the ‘networking’ stack plays a vital role in ensuring secure and any time access, regardless of where the applications are hosted - on premise or on the cloud.
The pressure on the network to quickly adapt to new and emerging technologies while ensuring seamless user experience only amplifies the network’s complexity and the efforts it takes to manage it. There are myriad issues that can come up on a daily basis, based on the size of the infrastructure and the design complexity. A wrong VLAN on a port or misconfiguration of a VLAN can cause multiple ports and circuits to go down as the ports may get blocked due to spanning tree. A small change, misconfiguration or incident can have a significant impact on the business.
Robust change management and incident management systems help in avoiding such issues. However, network complexity gives rise to many such scenarios, and solution doesn’t come easy as one size does not fit all. For instance, checking a layer 2 issue when there are 10 switches to manage is radically different from a landscape with 100 switches in a domain. There is no efficient solution that can help in simulating such issues, as replication of the existing configuration and the device sets adds to complexity and additional requirements.
Simplifying configuration with software-defined networking
Software-defined networking (SDN) has eased the configuration part by segregating the network control plane from the forwarding plane. It has enabled a ‘single plane of glass’, wherein all the nodes can be managed and controlled from a single controller and management dashboard. The traditional configuration is abstracted with user-friendly graphical user interface. The time spent on configuring routing policies by logging into individual routers/switches command-line interface is dramatically reduced by pushing the same from the single user interface, which, in turn pushes to all the respective nodes where the policies need to be configured.
However, even though the configuration part is simplified, the alarms or events have increased and have become more complex. Consider this - the connectivity from the user to an application as a single entity has multiple components like LAN (either wired/Wi-Fi), WAN (either traditional or software defined) plus where it is hosted -- in the data center or the cloud. Each of this is monitored by its own respective controllers, and each logs the alerts/incidents to the organization’s IT Service Management desk.
An application can slow down due to network latency or high response time from the application server itself. Earlier, these were marked as network issues. Now, leveraging new gen solutions, we are able to garner better insights into these issues and classify these in two different domains – network and applications – thereby, ensuring issues are assigned to the appropriate teams for resolution.
Similarly, an overlay tunnel would have gone down because of the underlay device's port which is either flapping or has gone down. In such scenarios, each endpoint logs its own events - the SD-WAN controller would log an incident for the tunnel going down and the underlay device would log an event for the port going down. With multiple incidents being logged from multiple controllers and the underlay devices, ITOps find it very difficult to correlate each of them. With separate teams handling issues from different components, the correlation becomes even more difficult.
The focus of the operations teams is on configuration and troubleshooting. The configuration part is now simplified with next-gen solutions, but troubleshooting has become more complex.
Is there a solution that can ease the troubleshooting or issue resolution process as well? The answer is a resounding ‘yes’.
Intelligent network with AI and ML
Networks need to be intelligent to meet the dynamic needs of the digital age. AI and ML play a critical role in enabling this by automating and infusing intelligence in network operations.
Figure 1 highlights a solution that listens to all alerts/events generated by different endpoints (network devices, SDN controllers, SDWAN edge devices, Wi-Fi controllers, etc.). The ML model is trained to learn from these events and then correlate them to the project as a single issue, which the operations team can look into. This helps all the teams find the root cause faster, enabling quicker resolutions, while reducing the MTTR (mean time to resolve) significantly.
In addition to this, the AI/ML-based solution can also help in predicting issues in the system. For example, the interface utilization is monitored by the solution and it learns the pattern of the traffic - the time stamp at which the traffic surges/dips. Over a period of time, the solution will predict the utilization at a given time in the future. This helps in capacity planning which was previously fraught with human errors over spreadsheets and multiple logics.
Wipro is developing a bot that listens to the incidents and follows a defined standard operating procedure for that particular incident. This approach helps in reducing the time taken by an L1 engineer to perform basic troubleshooting. Automating the L1 task enabled Wipro to reduce 110 person-hours for managing 60 sites of a customer.
With the robust power of AI and ML, solutions that learn the types of issues and their resolutions can be developed. Once the model is trained, it will provide the fixes for the issues raised. Eventually, issues will reduce by virtue of co-relation and the solutions as autosuggestions. If there are more matches with the fixes that are suggested by the solution, then the solution can provide the fix with data modeling.
Putting networks on autopilot mode
A network that can fix and optimize itself without human intervention will revolutionize network management. AI and ML can train software-defined networks to learn to help with network management using operational data. Automation, AI, and ML will drive network management into the autopilot mode and the future.
Neelakantan R
Automation and Network Lead - CIS Network Practice
Wipro
Neelakantan has 11 years of experience in the field of Information Technology and Network Infrastructure. He has expertise in SDN, SDWAN, Network engineering, Automation and Project Management. His key area of focus is to benchmark various technology solutions and develop applications that further enhance user experience.
He holds multiple industry leading certifications like CCNA, CCNP, VMware Certified Professional - Network Virtualization (NSX-T), Red Hat Certified Ansible Automation Specialist, Red Hat Certified Delivery Specialist – Container Platform, Cisco Black Belt on DevNet, ACI, DNAC and SDWAN tracks, and Aviatrix Certified MCNA.