All Apps and Add-ons

Real Time Data Adding and Monitoring

snehasal
Explorer

Hi Everyone,

I am a newbie to Splunk and I am trying to implement the use case as below. All the data is generated in real time and needs to be processed in real time.
I have multiple WorkFlows running in Jenkins and there are multiple Sessions running inside each Workflow. We have nearly 190 WorkFlows and 1400 Sessions which run once a day (80% of Workflows) or multiple times a day. We have a Table in MySQL which records the WorkflowName, SessionName, StartTime, EndTime. We are trying to achieve the below two things.

1. Data Visualization: We want to set up a Dashboard, which will display the average Runtime for each workflow for each day i.e. if WorkFlow is running once/day, we will have Average Runtime for that WorkFlow = Runtime for WorkFlow. Also, if the WorkFlow is running thrice/day, we will have Average Runtime = Summation of All Runtimes for that day/ number of times Workflow runs.
Questions:
Which is the best way to add data to splunk? Should I use Jenkins App for Splunk, DBConnect App for Splunk or use a log file which will be monitored by Splunk for Real Time Data.

2. Alerting System: Here, I want to trigger an email if the Session or WorkFlow is running beyond its Expected Finish Time. The Expected Finish time is the Average of the Runtime for that Session/Workflow over Past 12 months (or any other Duration).
Example:Say, we have a Session- 'Temp1' which starts at 11am. The expected finish time for 'Temp1' is 30 mins i.e. 11.30am. Now, if the 'Temp1' does not finish till 11.30 am, I need to send an alert to indicate that 'Temp1' is still running.
Questions:
Should I use log file, DBConnect or Jenkins for Splunk to achieve this?
Is it possible for Splunk to Calculate the Expected Finish Time on its own based on the previous history and generate alert accordingly?

Requesting for help with this case.

Thanks,
Sneha Salvi

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

Here's some advice to keep from stressing your system and your people:

1) You don't want to use the average duration for expected finish, or about half of your runs will set off the alert. You need something more like (average + 2*stdev), which takes it down to about 2.3% of the jobs setting off the alert. Better yet, calculate an "expected finish" at +1 stdev --which means that 84% of the jobs should complete by that point -- and have a separate "late enough to alert" at +2, +2.5 or +3 stdevs, depending on whether you want 2.3%, 0.7% or 0.1% alerts, respectively. I'd suggest you start at 2 stdevs and see whether that's acceptable.

2) You don't need real-time alerting for this, you can use a scheduled job at 5-minute increments. If there is some small subset of jobs that need to run very fast and get alerted instantly if they go down, then handle them separately from the vast majority of workflows that don't need such intensive treatment. Even then, they probably don't need to be realtime in the splunk sense, just run very frequently.

3) Your past duration calculations should be gathered together into a summary index. You can calculate each new day's durations once per day at about an hour or two after midnight and write them to the summary index. That done, next you can calculate your expected duration and alert duration for each workflow and session, and then produce a lookup file that gives the desired and expected values, all that being completed once per day. The alerting job will use the lookup table, saving all the calculations the rest of the time.

The specifics for sample code for the duration calculation is on your other question at https://answers.splunk.com/answers/550476/search-query-for-nested-jobs.html#answer-549570. You just need to group it by _time, binned at the month level (for example) and use sistats instead of stats to collect it to the summary index.

View solution in original post

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Here's some advice to keep from stressing your system and your people:

1) You don't want to use the average duration for expected finish, or about half of your runs will set off the alert. You need something more like (average + 2*stdev), which takes it down to about 2.3% of the jobs setting off the alert. Better yet, calculate an "expected finish" at +1 stdev --which means that 84% of the jobs should complete by that point -- and have a separate "late enough to alert" at +2, +2.5 or +3 stdevs, depending on whether you want 2.3%, 0.7% or 0.1% alerts, respectively. I'd suggest you start at 2 stdevs and see whether that's acceptable.

2) You don't need real-time alerting for this, you can use a scheduled job at 5-minute increments. If there is some small subset of jobs that need to run very fast and get alerted instantly if they go down, then handle them separately from the vast majority of workflows that don't need such intensive treatment. Even then, they probably don't need to be realtime in the splunk sense, just run very frequently.

3) Your past duration calculations should be gathered together into a summary index. You can calculate each new day's durations once per day at about an hour or two after midnight and write them to the summary index. That done, next you can calculate your expected duration and alert duration for each workflow and session, and then produce a lookup file that gives the desired and expected values, all that being completed once per day. The alerting job will use the lookup table, saving all the calculations the rest of the time.

The specifics for sample code for the duration calculation is on your other question at https://answers.splunk.com/answers/550476/search-query-for-nested-jobs.html#answer-549570. You just need to group it by _time, binned at the month level (for example) and use sistats instead of stats to collect it to the summary index.

0 Karma

snehasal
Explorer

Thank you for the response.
I agree that going by (average + 2*stdev) is better option. In my case, creating summary index and lookup table for alerting is a good idea.
Summarizing the entire flow:
1. Create a Summary Index which Duration for each session and workflow at the end of each day.
2. Once summary index is built, use it to create a lookup file which has expected Duration and Alert Duration for each WorkFlow and Session.
3. When new data is added to the log file which the splunk is monitoring, splunk will use lookup table to see at what time job should finish and when can we send an alert.
4. Once the job is done, its average duration can be calculated to plot it in the dashboard and at the same time be appended to the summary Index.

Since, I have to create dashboards to see trends and also set up an alerting system, I feel that when we calculate the average duration of each session, we can append it to the summary index at the same time.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

@snehasal -

1) You don't want to be appending a bunch of detail data to a summary index, when that detail is not going to really affect the averages that are coming out of that summary data that you will be using for decisions. Don't do extra work that won't affect the results, just do it once at the end of the day. That's my opinion, perhaps others might disagree.

2) You might consider having a periodic job check the log file every 5 minutes (for example) and check to see what new jobs have started (add them to the "watchlist" csv), what existing jobs have completed (remove them from the "watchlist" csv and move them to the "completed" csv) , and then check the watchlist csv against the expected durations and calculate any alerts required.

Then your jobs in-process reporting is based primarily on the watchlist csv, as informed for completion time expectations by the historical lookup. You might decide to keep the job in the watchlist until one cycle (5 minutes) had passed since the job was completed, just to be able to show the completion on the dash for a bit. But having a completions csv allows another way of presenting that, so it might be unneeded.

0 Karma

snehasal
Explorer

Thanks for the response. This sounds like a very good method to achieve what I am planning to do. I am working towards it and will try to use the method as you have suggested.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...