Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Tech Talk | Getting the Most Out of Event Correlation and Alert Storm Detection in Splunk ITSI

LesediK
Splunk Employee
Splunk Employee

Getting the Most Out of Event Correlation and Alert Storm Detection in Splunk IT Service Intelligence

During a recent Observability Edition Tech Talk, Diving Deeper with AIOps. Attendees joined Jeff Wiedemann, Principal Observability Strategist at Splunk and heard about how Splunk ITSI can reduce alert noise, provide business context, and help you be proactive instead of reactive. As many of you know, noisy alerts, a lack of business context, and being reactive are common problems in organizations.

 

LesediK_7-1676518142760.png

See the key moments from the live Diving Deeper with AIOps  

Tech Talk  below!

LesediK_7-1676518142760.png

 

Trends in Alerts and Event Correlation

Customers are still facing challenges such as noisy alerts, lack of business context, and being reactive, which cause outages and down times and result in costly and problematic incidents.

 

LesediK_7-1676518142760.png

 

Splunk Observability Cloud and Splunk ITSI: Reducing Noise and Understanding the Environment

By grouping related alerts together, Splunk Observability Cloud and Splunk ITSI enabled the cloud operations team to reduce noise and ask more questions about the environment, such as if it is healthy, the incoming alert volume, MTTA and MTTR, and if they are in the middle of an alert storm.

LesediK_7-1676518142760.png

Alert Storm Detection - Understanding Incoming Alerts and Episode Analytics

This KPI helps detect alert and episode storms by monitoring the volume of incoming alerts, episodes created, and the aggregation policies used to create them.

 

LesediK_7-1676518142760.png

Want to learn more? Check out the entire Tech Talk

LesediK_7-1676518142760.png

 

Splunk's IT Service Intelligence (ITSI) can help reduce alert noise by deduplication and grouping, provide alert and episode monitoring and storm detection, and allow for continuous improvement analytics. Through Splunk ITSI's alert pipeline, external alerts can be pulled into Splunk and ITSI as notable events with correlation searches. Using the Splunk ITSI Content Pack for Monitoring and Alerting can unlock these capabilities. 

 

 

Detecting Alert and Episode Storm Activity with Splunk ITSI

After reducing alert noise, more questions arise: is the environment healthy? Is incoming alert volume normal? Are there any high-volume alerts that need to be tuned? And, are we in the middle of an alert storm? Splunk analytics can provide insights to answer these questions.

LesediK_7-1676518142760.png

How to Power the Episode Analytics Service Tree

To power the episode analytics service tree, an entity discovery search must be run to find all the aggregation policies, which then need to be turned into entities, and then the Splunk ITSI event analytics, episode analytics, and alert analytics services must be enabled.

 

 

Are you tracking the volume of incoming alerts in your environment? If you're not, you should be! It's an essential KPI for understanding what's going on in the environment. The Splunk ITSI Content Pack for Monitoring and Alerting provides a variety of KPIs to help you do this, including Incoming Alerts by Monitoring Tool, Incoming Alerts by Severity, and Incoming Alerts by Source, in addition to have Episode Analytics, which tracks the volume of new episodes being created in the environment.

 

 

How to Set Up Proactive Alert Storm Detection in Splunk ITSI

This dashboard helps you apply adaptive thresholding to the Alert Storm Detection KPI and leverage service monitoring correlation searches to produce notable events when the KPI goes critical. The Splunk ITSI Alert and Episode Monitoring Aggregation Policy helps to configure proactive actions to be taken when an alert or an episode storm is detected.

LesediK_7-1676518142760.png

Understanding the Fields to Analyze Feature on a Dashboard Panel

The Fields to Analyze input allows users to customize the dashboard to plot values over time to identify what might be causing an alert storm, and the Dynamic Alert Clustering input provides on-the-fly grouping of alerts based on typical aggregation policies.

LesediK_6-1676518142764.png

 

How to Use Historical Analysis for Continuous Improvement in Your Operations Center

Are you a leader at an operations center looking to make smart decisions about alert tuning, staffing, and grouping? Splunk ITSI Event and Incident Operations Posture dashboard can help! With this dashboard, you can get a detailed view of how your teams have been operating over the last 30 days or more, understand how your team is functioning, or how alerts have behaved over time.

This dashboard lets you see the number of episodes created, unacknowledged, and the rate of acknowledgement. You can also compare episodes of different severities, filter out episodes by their acknowledgement or severity and get an understanding of how your teams handle critical versus non-critical episodes.

Alert Storm Detection and triage capabilities are built in, so you can easily detect and triage the cause of an alert storm. Historical analysis for continuous improvement with visualizations like field value distributions and time series charts can help you find the source of the storm. Plus, you can customize the fields to make it relevant to your organization and alerts.

 

Reducing Alert Noise and Providing Business Context with Splunk IT Service Intelligence (ITSI)

Watch the entire Tech Talk, Diving Deeper with AIOps for a full discussion on how IT Service Intelligence can help tame alert storms. We hope that this Tech Talk was useful, and we look forward to seeing you at .conf to learn more. 

It is highly recommended that attendees have attended or watched Part 1: Getting Started with AIOps: Event Correlation Basics and Alert Storm Detection in Splunk IT Service Intelligence  prior to attending Part 2: Diving Deeper with AIOps. As this Tech Talk will be a much deeper dive into the concepts and capabilities covered in Part 1. Watch Now

Don't forget to check out Tech Talks right here on the Community site for additional resources and answers to your questions.

 

 

 

Highly Recommended .conf23 Session

Get Updates on the Splunk Community!

Modern way of developing distributed application using OTel

Recently, I had the opportunity to work on a complex microservice using Spring boot and Quarkus to develop a ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...

Archived Metrics Now Available for APAC and EMEA realms

We’re excited to announce the launch of Archived Metrics in Splunk Infrastructure Monitoring for our customers ...