Return recent version of event

sdwilkerson · ‎08-03-2010

We have a csv input which is a daily dump from a ticket DB for the current month.

The DB output looks something like this:

TICKET, CREATE_TIME, MODIFIED_TIME, A, B, C, D, E

The daily input will continually re-introduce events that match existing TICKETs to Splunk every time it runs; however, other fields will be updated and filled in over time.

We need to only report on the most recent version of a given ticket event in Splunk.

What I have working now is this:

index="ticket" | top limit=1 MODIFIED_TIME by TICKET | stats list by TICKET

It is reasonably fast over small data groups, but I feel like there must be a better way to do this. Something like this would be better " | stats list by Case_ID max(Modified_Time) " if it were possible.

snip Log Sample

"Case ID","Create Date","Modified Time","Detection Type","Time of Detection","Severity","External Media","Policy Violation","Offline Scan","Wiped","chk_UserRunsWithAdminPriv","chk_SystemIsManagedDesktop","chk_Escalated","IP Address","Hostname","Faceplate Jack","Property Number","Local ID","Group"
"HD194292","7/26/2010 4:40:01 PM","7/27/2010 7:44:35 AM","IPS","","","None","None","None","None","","","","192.168.202.194","HOSTA","","","BOB6243","09-92"
"HD194292","7/26/2010 4:40:01 PM","7/27/2010 7:44:35 AM","IPS","","","None","None","None","None","","","","192.168.202.194","HOSTA","","","BOB6243","09-92"
"HD194292","7/26/2010 4:40:01 PM","7/28/2010 9:30:44 AM","IPS","","IN PROCESS","None","None","None","None","","","","192.168.202.194","HOSTA","","","BOB6243","09-92"
"HD194016","7/23/2010 12:01:56 AM","7/23/2010 1:46:12 PM","Anti-Virus","7/23/2010 1:19:30 AM","Brown","None","None","None","None","","","","192.168.162.209","HOSTB","","000771220","BOB4512","10-06"
"HD194016","7/23/2010 12:01:56 AM","7/23/2010 1:46:12 PM","Anti-Virus","7/27/2010 5:30:30 PM","Blue","None","None","None","None","","","","192.168.162.209","HOSTB","","000771220","BOB4512","10-06"

/snip

Sean

sdwilkerson · ‎08-16-2010

@Lowell:
Case_ID is what is used. Originally I tried to simplify things in the question by making terms generic.
The plan is for imports to be done in batch/bulk on a weekly basis. An analyst will run a remedy "macro" to dump the last 30 days to a local file then will place that file in a Splunk-monitored sinkhole.
I feel like there should be an easier way to set a splunk expression to dedup based on Modified_Time (in a macro) allowing the analysts to decide how to report on it from there.

Lowell · ‎08-06-2010

How are you getting this data into splunk? Are these events exported in bulk on a scheduled basis, or are events written ONLY when there is a some kind of state change? You mentioned that your existing search is fast over a small data set, so are you anticipating a significant growth? (And if so, is the grow due to higher volumes of data or due to duplicate records?)

Lowell · ‎08-06-2010

So you are using the term "TICKET" and "Case_ID" synonymously?

gkanapathy · ‎08-03-2010

I think:

index="ticket" | dedup TICKET

or

index="ticket" | stats first(_time) as _time by TICKET

or

index="ticket" | stats max(MODIFIED_TIME) as _time by TICKET

will also do it, but nothing is going to run any faster than that (or than what you have). The reason is that Splunk needs to look at every event in order to know what tickets actually exist. It's not a question of looking for the next ticket time for known tickets, it a question of not knowing when to stop looking for new ticket numbers.

Most searches run in time according to how many events must be fetched back, which is determined by the base index query (i.e., the part before the first | (pipe character)). There are some special-case processing exceptions (e.g., head) and some that are affected differently by their map-reduce behavior (e.g., transaction) but for most stats and eval commands, things that come after the first | have a very minor effect on the search run time.

You might want to run a periodic job in the background that writes the results into a lookup table to speed up reporting on this, but otherwise, this search needs to look at every event. If there's a way to narrow it down (e.g., if only some of the items in the index are changes or creates, use index=ticket item_type=ticket_change OR item_type=ticket_create)

gkanapathy · ‎08-04-2010

For that you would want to just run one of the above as a periodic job and write it to a lookup.

sdwilkerson · ‎08-04-2010

Gerald,
Lots of ideas here, but I won't know in advance how/what they will want to report on in Splunk. Do you have any thoughts how I could always just dedup TICKET by max(MODIFIED_TIME) so it could be passed in search?

gkanapathy · ‎08-03-2010

If you are looking for a specific TICKET or set of TICKETS though, these can be made fast.

Lowell · ‎08-03-2010

You may have issues doing date comparison (min or max) depending on the format of your timestamp.

Splunk has a very limited concept of data-types basically a field is either a number, or a text string (which is the default). Therefore you will end up with legigraphical sorting of your timestamps and not a true date order. There are a ways of doing this, but the simpilier approach would be to use the timestamp of your event: _time

Have you tried using max(_time) instead of max(MODIFIED_TIME)?

You haven't given any examples, so I'm making this up. If your MODIFIED_TIME field use a date format like "Tue Aug 3 18:00:34 EDT 2010", then simply sorting (or getting min/max, or first/last) for the MODIFIED_TIME may not work the way you want it. Instead, you have to convert the timestamp into a numeric value and use that instead. (If you were using MODIFIED_TIME to feed your event timestamp, this approach wouldn't be necessary.)

index="ticket" | eval mtime=strptime("%a %b %d %H:%M:%S %z %Y", MODIFIED_TIME) | stats min(mtime) as MODIFIED_TIME, first(A), first(B), first(C), first(D), first(E) by TICKET

This may or may not be right. Without some sample data it's hard to say for sure.

Lowell · ‎08-06-2010

Hmmm. I feel like I'm missing something here about what you doing and what you are trying to improve. I've added some comments to the question, maybe some additional info would be beneficial.

sdwilkerson · ‎08-06-2010

Lowell, All ticket-tracking is done through "CREATION_DATE". MODIFIED_TIME is only necessary so we can determine which similar event in Splunk to read.
I realize it isn't a DB, but figure there must be a good way to do this that is more general than what I have now.
I have a log sample I will add to the original question.
Thanks again for all your help.

Lowell · ‎08-03-2010

Perhaps you should use MODIFIED_TIME for your event timestamp instead of CREATE_TIME. Keep in mind that splunk doesn't work like a database. The first event to return is going to be based on your order of your stored events, which is based on _time (events are internally stored in reverse time order.) Again, doing a min/max on MODIFIED_TIME only works out of the box for certain time format strings, like YYYY-MM-DD hh:mm:SS, but not for MM-DD-YYYY, for example.

sdwilkerson · ‎08-03-2010

Lowell,
_time is pulled from "CREATE_DATE" (date TICKET was created) and would be the same for every subsequent pull that included that ticket.

I did find that this also works:
index="ticket" | stats first(MODIFIED_TIME) by TICKET
I would like to get the "stats list" output as well but that doesn't work in combo with "stats first".
Any thoughts on a better way to achieve this (non-stats ideas are welcome too)?

Return recent version of event

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!