Sunday, 1 March 2020

How machine learning and automation can modernize the network edge

If you want to know the future of networking, follow the money — right to the edge.
Applications are expected to move from data centers to edge facilities in record numbers, opening up a huge new market opportunity. The edge computing market is expected to grow at a compound annual growth rate of 36.3 percent between now and 2022, fueled by rapid adoption of the “internet of things,” autonomous vehicles, high-speed trading, content streaming and multiplayer games.
What these applications have in common is a need for near zero-latency data transfer, usually defined as less than five milliseconds, although even that figure is far too high for many emerging technologies.  
The specific factors driving the need for low latency vary. In IoT applications, sensors and other devices capture enormous quantities of data, the value of which degrades by the millisecond. Autonomous vehicles require information in real-time to navigate effectively and avoid collisions. The best way to support such latency-sensitive applications is to move applications and data as close as possible to the data ingestion point, therefore reducing the overall round-trip time. Financial transactions now occur at sub-millisecond cycle times, leading one brokerage firm to invest more than $100 million to overhaul its stock trading platform in a quest for faster and faster trades.

Operational challenges

As edge computing grows, so do the operational challenges for telecommunications service provider such as Verizon Communications Inc., AT&T Corp. and T-Mobile USA Inc. For one thing, moving to the edge essentially disaggregates the traditional data center. Instead of massive numbers of servers located in a few centralized data centers, the provider edge infrastructure consists of thousands of small sites, most with just a handful of servers. All of those sites require support to ensure peak performance, which strains the resources of the typical information technology group to the breaking point — and sometimes beyond. 
Another complicating factor is network functions moving toward cloud-native applications deployed on virtualized, shared and elastic infrastructure, a trend that has been accelerating in recent years. In a virtualized environment, each physical server hosts dozens of virtual machines and/or containers that are constantly being created and destroyed at rates far faster than humans can effectively manage. Orchestration tools automatically manage the dynamic virtual environment in normal operation, but when it comes to troubleshooting, humans are still in the driver’s seat. 
And it’s a hot seat to be in. Poor performance and service disruptions hurt the service provider’s business, so the organization puts enormous pressure on the IT staff to resolve problems quickly and effectively. The information needed to identify root causes is usually there. In fact, navigating the sheer volume of telemetry data from hardware and software components is one of the challenges facing network operators today. 

Machine learning and automation 

A data-rich, highly dynamic, dispersed infrastructure is the perfect environment for artificial intelligence, specifically machine learning. The great strength of machine learning is the ability to find meaningful patterns in massive amounts of data that far outstrip the capabilities of network operators. Machine learning-based tools can self-learn from experience, adapt to new information and perform humanlike analyses with superhuman speed and accuracy.  
To realize the full power of machine learning, insights must be translated into action — a significant challenge in the dynamic, disaggregated world of edge computing. That’s where automation comes in.
Using the information gained by machine learning and real-time monitoring, automated tools can provision, instantiate and configure physical and virtual network functions far faster and more accurately than a human operator. The combination of machine learning and automation saves considerable staff time, which can be redirected to more strategic initiatives that create additional operational efficiencies and speed release cycles, ultimately driving additional revenue. 

Scaling cloud-native applications

Until recently, the software development process for a typical telco consisted of a lengthy sequence of discrete stages that moved from department to department and took months or even years to complete. Cloud-native development has largely made obsolete this so-called “waterfall” methodology in favor of a high-velocity, integrated approach based on leading-edge technologies such as microservices, containers, agile development, continuous integration/continuous deployment and DevOps. As a result, telecom providers roll out services at unheard-of velocities, often multiple releases per week. 
The move to the edge poses challenges for scaling cloud-native applications. When the environment consists of a few centralized data centers, human operators can manually determine the optimum configuration needed to ensure the proper performance for the virtual network functions or VNFs that make up the application.
However, as the environment disaggregates into thousands of small sites, each with slightly different operational characteristics, machine learning is required. Unsupervised learning algorithms can run all the individual components through a pre-production cycle to evaluate how they will behave in a production site. Operations staff can use this approach to develop a high level of confidence that the VNF being tested is going to come up in the desired operational state at the edge. 

Troubleshooting at the speed of AI 

AI and automation can also add significant value in troubleshooting within cloud-native environments. Take the case of a service provider running 10 instances of a voice call processing application as a cloud-native application at an edge location. A remote operator notices that one VNF is performing significantly below the other nine.  
The first question is, “Do we really have a problem?” Some variation in performance between application instances is not unusual, so answering the question requires a determination of the normal range of VNF performance values in actual operation. A human operator could take readings of a large number of instances of the VNF over a specified time period and then calculate the acceptable key performance indicator values — a time-consuming and error-prone process that must repeated frequently to account for software upgrades, component replacements, traffic pattern variations and other parameters that affect performance.
In contrast, AI can determine KPIs in a fraction of the time and adjust the KPI values as needed when parameters change, all with no outside intervention. Once AI determines the KPI values, automation takes over. An automated tool can continuously monitor performance, compare the actual value to the AI-determined KPI and identify underperforming VNFs.
That information can then be forwarded to the orchestrator for remedial action such as spinning up a new VNF or moving the VNF to a new physical server. The combination of AI and automation helps ensure compliance with service-level agreements and removes the need for human intervention — a welcome change for operators weary of late-night troubleshooting sessions. 

Harnessing the competitive edge

As service providers accelerate their adoption of edge-oriented architectures, IT groups must find new ways to optimize network operations, troubleshoot underperforming VNFs and ensure SLA compliance at scale. Artificial intelligence technologies such as machine learning, combined with automation, can help them do that.
In particular, there have been a number of advancements over the last few years to enable this AI-driven future. They include systems and devices to provide high-fidelity, high-frequency telemetry that can be analyzed, highly scalable message buses such as Kafka and Redis that can capture and process that telemetry, and compute capacity and AI frameworks such as TensorFlow and PyTorch to create models from the raw telemetry streams. Taken together, they can determine in real time if operations of production systems are in conformance with standards and find problems when there are disruptions in operations.
All that has the potential to streamline operations and give service providers a competitive edge — at the edge.
loading...