Splunk Search

How to "chain" events together ? And how to follow the execution of all the chain ?

3no
Communicator

Hi Guys,

I'm trying to follow the execution of a number of script, here is my problem :

I have a lot of batch scripts running, I know when they start, when they finish, and when they are supposed to finish.

Some of them but not all are related to each other (like a chain), for exemple if the 3rd script don't finish, the 4rd one doesn't start.
And if one in the chain is late (took too much time to complete) all the chain will be late.

For now, I can find pretty easily if a script is late (exeptedTime < exectutionTime), but I want to go a little bit further.
I want to generate an alarm if one of the job is late and will cause all the chain that is part of to be also late.

I don't know how to materialize in Splunk the relation between the scripts (maybe with a lookup ? I tried but failed to make it work)

The other problem that I'm facing is that the number of script in a chain is not the same (it can be 3 or 10 or 14) and the average time of execution depends on each script.

Thank you for your time and help.

3no

0 Karma
1 Solution

3no
Communicator

Hi,

Actually my case is a little harder DMohn because if a script is late, it doesn't mean that all the chain will be (because we leave a delta time if something happen). Anyway Thank you guys, your ideas really helped me a lot.
I used a combo of both your answers (I didn't thought about using case, I was trying to do it with if statement), I post the solution if anyone is interested :

1) First the lookup jobordername.csv :

chainName,  jobName,    order,  maxorder,   average_time,   finish_time,
XXXXXXX,      YYYYYYY1,    1,          9,             30,          18:15
XXXXXXX,      YYYYYYY2,    2,          9,             60,          18:15
etc, etc, etc, ...., ....... 

2) Second, the search (I'll try to explain as much as possible the request) :

1 - What is the last job that finished ?

Index=index_name chainName="XXXXXXXXXX" (status="TERMINATED" OR status="FAILURE" OR status="SUCCESS")
| head 1
| lookup orderjobname.csv jobName OUTPUTNEW order, maxorder, average_time, finish_time

2 - Is this job the last of the chain ?

| eval job_left=maxorder-order 
| eval foo=maxorder-job_left

3 - Classify the job :

| append [|inputlookup orderboxname.csv] 
| dedup jobName
| table application, boxName, date, hostname, jobName, jobType, status, order, maxorder,average_time, finish_time, foo
| eval job_done=case(order=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Job done", order!=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Not the last Job", status=="FAILURE", "Job failure", isnull(status), "NULL") 

4 - How many time will it take for jobs in the chain that are left, to finish ?

| streamstats last(foo) as foo
| eventstats sum(eval(if(order > foo, average_time, NULL))) AS time_left

5 - Will it take more than the limit ?

| eval n=strptime(finish_time, "%H:%M")
| eval y=date+time_left
| where y > n

6 - And then just custom the visualization.

| eval n=strftime(n, "%c")
| eval y=strftime(y, "%c")
| table application, boxName, jobName, job_done, n, y
| rename n AS time_limit y AS "Job expected to finish at:"

3no.

View solution in original post

3no
Communicator

Hi,

Actually my case is a little harder DMohn because if a script is late, it doesn't mean that all the chain will be (because we leave a delta time if something happen). Anyway Thank you guys, your ideas really helped me a lot.
I used a combo of both your answers (I didn't thought about using case, I was trying to do it with if statement), I post the solution if anyone is interested :

1) First the lookup jobordername.csv :

chainName,  jobName,    order,  maxorder,   average_time,   finish_time,
XXXXXXX,      YYYYYYY1,    1,          9,             30,          18:15
XXXXXXX,      YYYYYYY2,    2,          9,             60,          18:15
etc, etc, etc, ...., ....... 

2) Second, the search (I'll try to explain as much as possible the request) :

1 - What is the last job that finished ?

Index=index_name chainName="XXXXXXXXXX" (status="TERMINATED" OR status="FAILURE" OR status="SUCCESS")
| head 1
| lookup orderjobname.csv jobName OUTPUTNEW order, maxorder, average_time, finish_time

2 - Is this job the last of the chain ?

| eval job_left=maxorder-order 
| eval foo=maxorder-job_left

3 - Classify the job :

| append [|inputlookup orderboxname.csv] 
| dedup jobName
| table application, boxName, date, hostname, jobName, jobType, status, order, maxorder,average_time, finish_time, foo
| eval job_done=case(order=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Job done", order!=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Not the last Job", status=="FAILURE", "Job failure", isnull(status), "NULL") 

4 - How many time will it take for jobs in the chain that are left, to finish ?

| streamstats last(foo) as foo
| eventstats sum(eval(if(order > foo, average_time, NULL))) AS time_left

5 - Will it take more than the limit ?

| eval n=strptime(finish_time, "%H:%M")
| eval y=date+time_left
| where y > n

6 - And then just custom the visualization.

| eval n=strftime(n, "%c")
| eval y=strftime(y, "%c")
| table application, boxName, jobName, job_done, n, y
| rename n AS time_limit y AS "Job expected to finish at:"

3no.

aaraneta_splunk
Splunk Employee
Splunk Employee

@3no - Glad to hear you were able to find a solution to your question. Please don't forget to click "Accept" below your answer to resolve this post and so others can easily find it. Don’t forget to upvote anything that was helpful too. Thanks!

0 Karma

woodcock
Esteemed Legend

You definitely need a lookup (or a case statement inside a macro) to set the thresholds and dependencies. Then you just build out from there.

DMohn
Motivator

You could try to realize your alarms with the help of a lookup...

I could think of a lookup structure like this:

uniqueScriptID, belongsToChain, predecessor, successor
script1, chain1, NULL, script2
script2, chain1, script1, script5
script3, chain2, NULL, script4
script4, chain2, script3, NULL
[...]

If you identify one late script you can identify all scripts within that chain, and trigger your alarms accordingly.

Hope this points to the right direction.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...