Solved: Discrepancy between spike shown in metrics data in...

wrangler2x · ‎05-01-2013

I have a search/alert that alerts me when certain indexes have more than the usual amount of event data using _internal metrics, and which runs once an hour. And then I have this search which I run for the previous hour, which shows me where the spike occurred:

index=_internal source=*metrics* group=per_index_thruput series="winevent_index"
| rename series as index
| eval MB=round(kb/1024,3)
| where MB > 1
| stats sum(MB) as MB by index date_hour date_minute
| sort date_hour, date_minute
| addtotals col=true MB row=false

I alternately might run this search:

index=_internal source=*metrics* group="per_index_thruput" series="winevent_index" | eval MB=round(kb/1024,3) | bucket _time span=1m | stats sum(MB) as MB by _time | eval mtime=strftime(_time, "%Y-%d-%m %H:%M") | table mtime  MB

In any case, doing one of these lets me see where the spike occurred and how many minutes it was, and then I want to run a search on the winevent_index for the time-frame where the spike showed in the metrics, and just a bit wider on either side. Now here what I am looking for is a spike in events per minutes, and I can slice and dice that information by host or whatever. This has worked well for me in identifying where the unusual log data is coming from and what sort of events were involved.

But recently I saw a spike in the metrics for winevent_index but I could not correlate that spike to a spike in events/minute in the actual index. This has me deeply puzzled. After some reflection, I began to wonder if the metrics include events that have been dropped into nullQueue via a transform.

MY QUESTION: is the data for the metrics post-index? Does it reflect what was indexed, or what was received.

Thanks for any insights on this whole issue!

wrangler2x · ‎05-06-2013

| But recently I saw a spike in the metrics for winevent_index but I could
| not correlate that spike to a spike in events/minute in the actual index.

Figured out what caused this spike and why it was not visible. One of my co-workers had setup a forwarder on another system via deployment monitor and that system had logs going back to 2011. This caused a the large amount of log data to be ingested and indexed by splunk. In the metrics data this showed up as a large lump within the one hour window, but in the index where the data was placed it was spread-out over a two year period so looking for data indexed during the hour the alert went off showed nothing unusual.

This search can be used to see what was the main contributing host by host name and sourcetype, and I ran a historical search and found the system, then the light dawned. So I would guess the answer to my question is that the data for the metrics is post-index. Here is the search (with the time window set to the day and hour in question):

index=_internal source=*license_usage.log* type=Usage | stats sum(b) as bytes by st h| eval MB = round(bytes/1024/1024,2) | fields h st MB | sort -MB | head 10

View solution in original post

wrangler2x · ‎05-06-2013

| But recently I saw a spike in the metrics for winevent_index but I could
| not correlate that spike to a spike in events/minute in the actual index.

Figured out what caused this spike and why it was not visible. One of my co-workers had setup a forwarder on another system via deployment monitor and that system had logs going back to 2011. This caused a the large amount of log data to be ingested and indexed by splunk. In the metrics data this showed up as a large lump within the one hour window, but in the index where the data was placed it was spread-out over a two year period so looking for data indexed during the hour the alert went off showed nothing unusual.

This search can be used to see what was the main contributing host by host name and sourcetype, and I ran a historical search and found the system, then the light dawned. So I would guess the answer to my question is that the data for the metrics is post-index. Here is the search (with the time window set to the day and hour in question):

index=_internal source=*license_usage.log* type=Usage | stats sum(b) as bytes by st h| eval MB = round(bytes/1024/1024,2) | fields h st MB | sort -MB | head 10

Discrepancy between spike shown in metrics data in internal index and what is in an index?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!