I've found some logs in our splunk environment that seem to be duplicates (they differ only by their srcip field--which means one is coming directly from a client, while the other comes from a syslog server). So far the only way I've found to determine if the entries are actually duplicates is to export the results into different files based on srcip, then remove the srcip field and diff the resulting files. I'd really like to find a way to pull this comparison off in splunk, but I've not been able to so far. Does anyone have any ideas about how to do this?
EDIT: Here's an example of what I'm dealing with (redacting some stuff, of course).
Aug 19 09:34:36 A.B.C.D srcip=A.B.C.D fac=authpriv pri=notice sudo: USER : TTY=pts/8 ; PWD=/var/log ; USER=root ; COMMAND=/bin/grep ssh messages
Aug 19 09:34:36 A.B.C.D srcip=W.X.Y.Z fac=authpriv pri=notice sudo: USER : TTY=pts/8 ; PWD=/var/log ; USER=root ; COMMAND=/bin/grep ssh messages
These are clearly the same event; but the log is coming to splunk from A.B.C.D (the client) and W.X.Y.Z (a syslog server).
I initially hypothesized that it was everything of facility authpriv being duplicated, but that doesn't seem to be the case --I haven't been able to verify it at least.
So, again, what I'm looking for is a way to find events like this. "diff" won't work because they differ slightly, but I need to find all of our duplicates so I can take steps to cut out the second instance of the log.
I see. then this might do it:
... | rex "^(?<text1>.*?srcip=)(?<srcip>\S+)(?<text2>.*)" | eval text=text1.text2 | stats count(srcip) as c values(srcip) by text | where c>1
The transaction
approach can work, but don't use maxpause=1s
, use maxspan=1s
instead. The difference being that maxpause
is about time between events, and maxspan=1s
means that the total duration of the transaction cannot exceed 1 second.
I see. then this might do it:
... | rex "^(?<text1>.*?srcip=)(?<srcip>\S+)(?<text2>.*)" | eval text=text1.text2 | stats count(srcip) as c values(srcip) by text | where c>1
I've determined that there exist duplicate lines and I'm trying to determine how many duplicates I have or any information about them that could lead to reducing the duplicates. Also I'm certain they are duplicates because the timestamps don't differ at all and they log the same activity on the same machine (for example, two logs of a user su'ing to root).
I suppose I also don't understand, do the individual events have timestamps that differ by a second? Also, I suppose I should note that log lines are inherently extremely similar, differing only by a field or two, so I ask, are there other fields in your data (some GUID or sessionid, e.g.) that indicate that they are the same? If so, it seems more productive to focus on the identifying field values than the differing ones.
I don't understand your question. Are you trying to find duplicate lines (and it sounds to me like you've already determined that there are duplicate lines) or are you trying to group together sets of lines and then see if the entire set is the same as another set?
I tried piping my search to transaction with a maxpause of 1s, since the duplicates seem to come in at the same time. But that led to enormous transaction that didn't really alleviate the situation.