The following two logical events are supposed to be one as they share the same threadId
- but instead they are split into two by my transaction command below:
Jan 22 10:33:56 smtp1 postfix/smtpd[23895]: 3AB561005181: client=[REDACTED]
host = smtp1 source = /var/log/maillog sourcetype = postfix_syslog
...
Jan 22 10:33:56 smtp1 postfix/cleanup[24020]: 3AB561005181: message-id=<521185171.8189.**********.JavaMail.levels.[REDACTED]
Jan 22 10:33:56 smtp1 postfix/qmgr[1517]: 3AB561005181: from=[REDACTED], size=2600, nrcpt=1 (queue active)
Jan 22 10:33:56 smtp1 postfix/smtp[24023]: 3AB561005181: to=[REDACTED], relay=[REDACTED], delay=0.03, delays=0/0/0.01/0.02, dsn=2.0.0, status=sent (250 ok: Message 58706800 accepted)
Jan 22 10:33:56 smtp1 postfix/qmgr[1517]: 3AB561005181: removed
host = smtp1 source = /var/log/maillog sourcetype = postfix_syslog
(Most other events are correctly grouped but a number aren't.)
How would you modify the following command to correctly group events?
sourcetype="postfix_syslog" threadId=*
| transaction "threadId", "host", "client" maxspan=24h maxpause=4h keepevicted=true startswith="client\="
| search threadId=3AB561005181
Note: if I add message-id
to the list of fields in transaction
, the events are grouped properly - but some others break. Why?
Generally, please share the logic of how transactions should be crafted for correct behavior.
Thank you!
(Splunk Enterprise 7.1.2)
Its not totally clear from your example, but the second group of logs above do not appear to contain any fields for 'client'
transaction "threadId", "host", "client"
will group all events which share ALL 3 of those common values.
In your example, I can see that thread and host would match in all cases, but client may not.
If only some of your events contained a client id and some did not, you would naturually expect transaction to group them into two seperate transactions.
Nick, I'd be the first to admit that transaction
's behavior is not something I fully - or even remotely - understand, nor that I fully understand its documentation - i.e. how it's supposed to behave. That said, please take a look at "Specifying multiple fields" section in Transaction Search Reference to see that the logical event defined by transaction
's output is not supposed to comprise of events that each has all of the fields. It's not the case in my testing, and it's not how it's supposed to behave, per that article.
On a side note, when I run the same exact transaction
command on the same group of events I did when I posted my question, it now correctly groups those events.
And no, I have no clue what's happening but one of the possibilities is that transaction
is in "alpha" version - i.e. unpredictable behavior that is only partially documented. That in a way makes sense given how many people are suggesting not to use it, and use stats
instead.
And yes, you're right, client=
happens only once per group - that's the way Postfix logs events.
Yes, its not immediately obvious from the docs - specifying 3 fields does not mean all 3 fields must exist, but IF they do, then the transaction will group based on them: (emphasis added)
"IF there is a transitive relationship between the fields in the fields list AND if the related events appear in the CORRECT sequence..."
In your case, host and trace appear in each event, but client is sporadic. maxspan could therefore become critical in satisfying a match.
Do you have satisfactory results by ommiting client?
I can't even reproduce the unexpected behavior in my original search - so all bets are off... 🙂
Like I said simply re-running my original search produces expected (correct) results today - I'll have to find another anomaly where the results are unexpected and can be reliably reproduced. But even before that I'll likely have to upgrade our instance to the latest version (from 7.1.2 to 8.0.1) just in case my issue is the result of a bug that's been squashed in subsequent releases.
Oh and the kicker is that some of the Postfix threads can run for weeks... Non-deliverable emails are deferred, then retried, then deferred for longer and so on... So maxspan
and maxpause
are of limited use here despite relative simplicity of the events.
If you have lots of long running open transactions (ie your search window is large), you could be hitting memory limits causing evictions - your example above is all the same timeframe, but it could be contributing.
I hear you - in that case Splunk normally produces a warning that I am on the lookout for - but I have yet to see it running out of memory - and I did try to run it into the ground... 🙂
sourcetype="postfix_syslog" threadId=*
| stats values(_time) as _time values(host) as host values(size) as size values(client) as client values(from) as from values(to) as to values(status) as status range(_time) as duration count as eventcount values(relay) as relay values(message_id) as message_id by threadId
| table _time host threadId size client from to status duration eventcount relay message_id
hi, @mitag
This may be same results. try this.
Not the same results, with some errors and generally I'd like to learn to use transaction
correctly vs. using stats
. Is there a reason I consistently get advice to use stats
when I ask questions about transaction
(performance aside)?
yes. transaction
is slow?
if it is not, I'm sorry.
with some errors
two days values are appeared?
some others break. Why?
maybe,the order of events with same time stamp is problem, startswith="client\="
will not work.
deferred email threads can last several days and have something like this in the _time
field:
1579731986,1579736186,1579740387,1579744587,1579748788,1579752988,1579757188,1579761388,1579765588,1579769789,1579773989,1579778189,1579782389,1579786589,1579790789,1579794989,1579799189,1579799190,1579803390,1579807590,1579811790,1579815991
...
startswith="client\="
will not work.
Any idea why? Seems to be working in most cases and it seems all postfix smtp threads start that way...
Logs are processed in the order they come in, so the order may be mixed.
Except:
if I add message-id to the list of fields in transaction, the events are grouped properly
So it's still a mystery to me why it works the way it does.
Isn't that because it's an appropriate field to link away time?
By the way, why put it together?
If you just search, you don't have to.
Hi @mitag
transaction
is necessary?
threadId
is extracted. what's problem?
The goal is to extract a set of additional fields that are only available in transaction
output. E.g.:
| table _time host threadId size client from to status duration eventcount relay message_id
(One could probably do something similar - and faster - via stats
- yet my goal is to group events related to a single Postfix transaction together for subsequent analysis - which only transaction
can do.)