Applying Machine Learning to DevOps

Posted by ECS Digital on Jun 6, 2017 9:37:02 AM


 

Machine Learning in DevOps

Andi Mann

This is a guest blog written by:

Andi Mann, Chief Technology Advocate, Splunk

With contributions by:

Jeff Spencer, Senior Engineer, Splunk

 

There is powerful synergy between DevOps and Machine Learning (ML) – and related capabilities, like Predictive Analytics, IT Operations Analytics (ITOA), Algorithmic IT Operations (AIOps), and Artificial Intelligence (AI).

Conceptually, ML represents codification and acceleration of Gene Kim’s “Culture of Continuous Learning”. With ML DevOps teams can mine massive complex datasets, detect patterns and antipatterns, uncover new insights, iterate and refine queries, and repeat continuously – all at ‘computer speed’.

Similarly, ML is in many ways the next-generation of Automation, building on John Willis’ and Damon Edwards’ prescription for ‘CAMS’. With automation, DevOps enables a much faster SDLC, but one that is too opaque, distributed, dynamic, and ephemeral for normal human comprehension. But like automation, ML uniquely handles the velocity, volume, and variety of data generated by new delivery processes and the next-generation of composable, atomized, and scaled out applications.

 

In practice, some key examples of applying ML to DevOps include:

 

Tracking application delivery

Activity data from ‘DevOps tools’ (like Jira, Git, Jenkins, SonarQube, Puppet, Ansible, etc.) provides visibility into the delivery process. Applying ML can uncover anomalies in that data – large code volumes, long build times, slow release rates, late code check-ins – to identify many of the ‘wastes’ of software development, including gold-plating, partial work, inefficient resourcing, excessive task switching, or process slowdowns.

 

Ensuring application quality

By analyzing output from testing tools, ML can intelligently review QA results, detect novel errors, and effectively build a test pattern library based on discovery. This machine-driven understanding of a ‘known good release’ helps to ensure comprehensive testing on every release, even for novel defects, raising the quality of delivered applications. 

 

Securing application delivery

Patterns of user behavior can be as unique as fingerprints. Applying ML to Dev and Ops user behaviors can help to identify anomalies that may represent malicious activity. For example, anomalous patterns of access to sensitive repos, automation routines, deployment activity, test execution, system provisioning, and more can quickly highlight users exercising ‘known bad’ patterns – whether intentionally or accidentally – such as coding back doors, deploying unauthorized code, or stealing intellectual property.

 

Managing production

Analyzing an application in production is where machine learning really comes into its own, because of the greater data volumes, user counts, transactions etc. that occur in prod, compared to dev or test. DevOps teams can use ML to analyze ‘normal’ patterns – user volumes, resource utilization, transaction throughput, etc. – and subsequently to detect ‘abnormal’ patterns (e.g. DDOS conditions, memory leaks, race conditions, etc.). 

 

Managing alert storms

A simple, practical, high-value use of ML is in managing the massive flood of alerts that occur in production systems. This can be as simple as ML grouping related alerts (e.g. by a common transaction ID; a common set of servers; or a common subnet). Or it can be more complex, such as ‘training’ systems over time to recognize ‘known good’ and ‘known bad’ alerts. This enables filtering to reduce alert storms and alert fatigue.

 

Troubleshooting and triage analytics

This is another area where today’s ML technologies shine. ML can automatically detect and even start to intelligently triage ‘known issues’, and even some unknown ones. For example, ML tools can detect anomalies in ‘normal’ processing, and then further analyze release logs to correlate this issue with a new configuration or deployment. Other automation tools can use ML to alert operations, open a ticket (or a chat window), and assign it to the right resource. Over time, ML may even be able to suggest the best fix! 

 

Preventing production failures

ML can go well beyond straight-line capacity planning in preventing failures. ML can map known good patterns of utilization to predict, for example, the best configuration for a desired level of performance; how many customers will use a new feature; infrastructure requirements for a new promotion; or how an outage will impact customer engagement. ML sees otherwise opaque ‘early indicators’ in systems and applications, allowing Ops to start remediation or avoid problems, much faster than typical response times.

 

Analyzing business impact

Understanding the impact of code release on business goals is critical to success in DevOps. By synthesizing and analyzing real user metrics, ML systems can detect good and bad patterns to provide an ‘early warning system’ to coders and business teams alike when applications are having problems (e.g. through early reporting of increased cart abandonment or foreshortened buyer journeys); or being wildly successful (e.g. through early detection of high user registrations or click-through rates).

 

Of course, there is no easy button for ML, yet. There is no substitute for intelligence, experience, creativity, and hard work. But we are already seeing much of this applied today and, as we continue to push the boundaries, the sky is the limit.

Topics: DevOps