Hi Guys,
I'm trying to follow the execution of a number of script, here is my problem :
I have a lot of batch scripts running, I know when they start, when they finish, and when they are supposed to finish.
Some of them but not all are related to each other (like a chain), for exemple if the 3rd script don't finish, the 4rd one doesn't start.
And if one in the chain is late (took too much time to complete) all the chain will be late.
For now, I can find pretty easily if a script is late (exeptedTime < exectutionTime), but I want to go a little bit further.
I want to generate an alarm if one of the job is late and will cause all the chain that is part of to be also late.
I don't know how to materialize in Splunk the relation between the scripts (maybe with a lookup ? I tried but failed to make it work)
The other problem that I'm facing is that the number of script in a chain is not the same (it can be 3 or 10 or 14) and the average time of execution depends on each script.
Thank you for your time and help.
3no
Hi,
Actually my case is a little harder DMohn because if a script is late, it doesn't mean that all the chain will be (because we leave a delta time if something happen). Anyway Thank you guys, your ideas really helped me a lot.
I used a combo of both your answers (I didn't thought about using case, I was trying to do it with if statement), I post the solution if anyone is interested :
1) First the lookup jobordername.csv :
chainName, jobName, order, maxorder, average_time, finish_time,
XXXXXXX, YYYYYYY1, 1, 9, 30, 18:15
XXXXXXX, YYYYYYY2, 2, 9, 60, 18:15
etc, etc, etc, ...., .......
2) Second, the search (I'll try to explain as much as possible the request) :
1 - What is the last job that finished ?
Index=index_name chainName="XXXXXXXXXX" (status="TERMINATED" OR status="FAILURE" OR status="SUCCESS")
| head 1
| lookup orderjobname.csv jobName OUTPUTNEW order, maxorder, average_time, finish_time
2 - Is this job the last of the chain ?
| eval job_left=maxorder-order
| eval foo=maxorder-job_left
3 - Classify the job :
| append [|inputlookup orderboxname.csv]
| dedup jobName
| table application, boxName, date, hostname, jobName, jobType, status, order, maxorder,average_time, finish_time, foo
| eval job_done=case(order=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Job done", order!=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Not the last Job", status=="FAILURE", "Job failure", isnull(status), "NULL")
4 - How many time will it take for jobs in the chain that are left, to finish ?
| streamstats last(foo) as foo
| eventstats sum(eval(if(order > foo, average_time, NULL))) AS time_left
5 - Will it take more than the limit ?
| eval n=strptime(finish_time, "%H:%M")
| eval y=date+time_left
| where y > n
6 - And then just custom the visualization.
| eval n=strftime(n, "%c")
| eval y=strftime(y, "%c")
| table application, boxName, jobName, job_done, n, y
| rename n AS time_limit y AS "Job expected to finish at:"
3no.
Hi,
Actually my case is a little harder DMohn because if a script is late, it doesn't mean that all the chain will be (because we leave a delta time if something happen). Anyway Thank you guys, your ideas really helped me a lot.
I used a combo of both your answers (I didn't thought about using case, I was trying to do it with if statement), I post the solution if anyone is interested :
1) First the lookup jobordername.csv :
chainName, jobName, order, maxorder, average_time, finish_time,
XXXXXXX, YYYYYYY1, 1, 9, 30, 18:15
XXXXXXX, YYYYYYY2, 2, 9, 60, 18:15
etc, etc, etc, ...., .......
2) Second, the search (I'll try to explain as much as possible the request) :
1 - What is the last job that finished ?
Index=index_name chainName="XXXXXXXXXX" (status="TERMINATED" OR status="FAILURE" OR status="SUCCESS")
| head 1
| lookup orderjobname.csv jobName OUTPUTNEW order, maxorder, average_time, finish_time
2 - Is this job the last of the chain ?
| eval job_left=maxorder-order
| eval foo=maxorder-job_left
3 - Classify the job :
| append [|inputlookup orderboxname.csv]
| dedup jobName
| table application, boxName, date, hostname, jobName, jobType, status, order, maxorder,average_time, finish_time, foo
| eval job_done=case(order=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Job done", order!=maxorder AND (status=="SUCCESS" OR status=="TERMINATED"), "Not the last Job", status=="FAILURE", "Job failure", isnull(status), "NULL")
4 - How many time will it take for jobs in the chain that are left, to finish ?
| streamstats last(foo) as foo
| eventstats sum(eval(if(order > foo, average_time, NULL))) AS time_left
5 - Will it take more than the limit ?
| eval n=strptime(finish_time, "%H:%M")
| eval y=date+time_left
| where y > n
6 - And then just custom the visualization.
| eval n=strftime(n, "%c")
| eval y=strftime(y, "%c")
| table application, boxName, jobName, job_done, n, y
| rename n AS time_limit y AS "Job expected to finish at:"
3no.
@3no - Glad to hear you were able to find a solution to your question. Please don't forget to click "Accept" below your answer to resolve this post and so others can easily find it. Don’t forget to upvote anything that was helpful too. Thanks!
You definitely need a lookup (or a case
statement inside a macro
) to set the thresholds and dependencies. Then you just build out from there.
You could try to realize your alarms with the help of a lookup...
I could think of a lookup structure like this:
uniqueScriptID, belongsToChain, predecessor, successor
script1, chain1, NULL, script2
script2, chain1, script1, script5
script3, chain2, NULL, script4
script4, chain2, script3, NULL
[...]
If you identify one late script you can identify all scripts within that chain, and trigger your alarms accordingly.
Hope this points to the right direction.