Solved: Is there a way, instead of going on the server, to...

ddrillic · ‎08-13-2018

My customer uses the following to monitor their hundreds of forwarders -

| metadata type=hosts index=<customer index> index=os index=perfmon 
| eval host=lower(host) 
| eval _time=recentTime 
| sort host, _time 
| stats latest(_time) as recentTime by host 
| eval LAST=strftime(recentTime,"%a %m/%d/%Y-%T %Z(%z)"), DAYS_AGO=round((recentTime-now())/86400,0)

When recentTime of a certain host is a couple of days old such as -5, they come to me saying, please bounce the server.

When I look at _internal all looks fine -

| metadata type=hosts index=_internal
| eval host=lower(host) 
| eval _time=recentTime 
| sort host, _time 
| stats latest(_time) as recentTime by host 
| eval LAST=strftime(recentTime,"%a %m/%d/%Y-%T %Z(%z)"), DAYS_AGO=round((recentTime-now())/86400,0)

Going on the server and we see that the monitored files are stale on the file system.

Bedside going on the server to look at the file system, is there a simpler way for me or the client to find out that the files are stale?

adonio · ‎08-13-2018

try this one out:

| tstats max(_time) as data_time where index=* by host | appendcols [| tstats max(_time) as internal_time where index=_* by host ]
| eval now=now()
| eval data_secondes_ago = now-data_time
| eval internal_data_seconds_ago = now-internal_time
| eval data_internal_gap = internal_time-data_time
| eval data_internal_gap_abs = abs(data_internal_gap)
| eval data_latency_true = if(data_secondes_ago>600, "1", "0") 
| eval internal_data_latency_true = if(internal_data_seconds_ago>600, "1", "0") 
| eval host_status = case(data_latency_true == 0 AND internal_data_latency_true == 0, "D. All Good", data_latency_true == 1 AND internal_data_latency_true == 1, "A. Check Server Down", data_latency_true == 1 AND internal_data_latency_true == 0, "B. No Data - Check Applications and Inputs", data_latency_true == 0 AND internal_data_latency_true == 1, "C. No internal data - Check disk size on host")
| eval now_human = strftime(now, "%c")
| eval data_time_human = strftime(data_time, "%c")
| eval internal_time_human = strftime(internal_time, "%c")
| sort host_status

hope it founds a new home 🙂

View solution in original post

adonio · ‎08-13-2018

try this one out:

| tstats max(_time) as data_time where index=* by host | appendcols [| tstats max(_time) as internal_time where index=_* by host ]
| eval now=now()
| eval data_secondes_ago = now-data_time
| eval internal_data_seconds_ago = now-internal_time
| eval data_internal_gap = internal_time-data_time
| eval data_internal_gap_abs = abs(data_internal_gap)
| eval data_latency_true = if(data_secondes_ago>600, "1", "0") 
| eval internal_data_latency_true = if(internal_data_seconds_ago>600, "1", "0") 
| eval host_status = case(data_latency_true == 0 AND internal_data_latency_true == 0, "D. All Good", data_latency_true == 1 AND internal_data_latency_true == 1, "A. Check Server Down", data_latency_true == 1 AND internal_data_latency_true == 0, "B. No Data - Check Applications and Inputs", data_latency_true == 0 AND internal_data_latency_true == 1, "C. No internal data - Check disk size on host")
| eval now_human = strftime(now, "%c")
| eval data_time_human = strftime(data_time, "%c")
| eval internal_time_human = strftime(internal_time, "%c")
| sort host_status

hope it founds a new home 🙂

DalJeanis · ‎08-13-2018

@adonio - You can't depend on all hosts being present in both lists, so appendcols will occasionally screw up the alignment. Better to use one of these two constructions for the aggregation

 | tstats max(_time) as data_time where index=* by host 
 | append [| tstats max(_time) as internal_time where index=_* by host ]
 | stats max(*) as * by host

OR

 | tstats max(_time) as internal_time where index=_* by host
 | join type=left host [ | tstats max(_time) as data_time where index=* by host]

Notice I've flipped the order of the files for the join, since there will presumably always be an _internal type record if there is any regular record, unless you use an extremely fine time range, but not always the reverse. Either way, the stats is probably the preferred option since it avoids the question of directionality completely.

adonio · ‎08-13-2018

thank you @DalJeanis for this important feedback! and for pointing out possible missalignment
super useful is the ... | stats <function>(*) as * by <field>
i think i prefer that approach but join will work too
there are times when you will see "real data" but no "internal data", one case, is low disk on machine where the forwarder is installed. Splunk will not generate its internal data but will keep on monitoring and send "real / live data"
@ddrillic, you are welcome to change integer in the following eval statements to answer your specific needs as the 600 number is an example only:

   | eval data_latency_true = if(data_secondes_ago>600, "1", "0") 
     | eval internal_data_latency_true = if(internal_data_seconds_ago>600, "1", "0")

ddrillic · ‎08-14-2018

Thank you @adonio and @DalJeanis - much appreciated.

Is there a way, instead of going on the server, to find out if the log files have gone stale?

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk