transaction: inconsistent results

mitag · ‎01-22-2020

The following two logical events are supposed to be one as they share the same threadId - but instead they are split into two by my transaction command below:

Jan 22 10:33:56 smtp1 postfix/smtpd[23895]: 3AB561005181: client=[REDACTED]
host = smtp1 source = /var/log/maillog sourcetype = postfix_syslog

...

Jan 22 10:33:56 smtp1 postfix/cleanup[24020]: 3AB561005181: message-id=<521185171.8189.**********.JavaMail.levels.[REDACTED]
Jan 22 10:33:56 smtp1 postfix/qmgr[1517]: 3AB561005181: from=[REDACTED], size=2600, nrcpt=1 (queue active)
Jan 22 10:33:56 smtp1 postfix/smtp[24023]: 3AB561005181: to=[REDACTED], relay=[REDACTED], delay=0.03, delays=0/0/0.01/0.02, dsn=2.0.0, status=sent (250 ok:  Message 58706800 accepted)
Jan 22 10:33:56 smtp1 postfix/qmgr[1517]: 3AB561005181: removed
host = smtp1 source = /var/log/maillog sourcetype = postfix_syslog

(Most other events are correctly grouped but a number aren't.)

How would you modify the following command to correctly group events?

sourcetype="postfix_syslog" threadId=*
| transaction "threadId", "host", "client" maxspan=24h maxpause=4h keepevicted=true startswith="client\="
| search threadId=3AB561005181

Note: if I add message-id to the list of fields in transaction, the events are grouped properly - but some others break. Why?

Generally, please share the logic of how transactions should be crafted for correct behavior.

Thank you!

(Splunk Enterprise 7.1.2)

nickhills · ‎01-26-2020

Its not totally clear from your example, but the second group of logs above do not appear to contain any fields for 'client'

transaction "threadId", "host", "client" will group all events which share ALL 3 of those common values.
In your example, I can see that thread and host would match in all cases, but client may not.

If only some of your events contained a client id and some did not, you would naturually expect transaction to group them into two seperate transactions.

If my comment helps, please give it a thumbs up!

mitag · ‎01-27-2020

Nick, I'd be the first to admit that transaction's behavior is not something I fully - or even remotely - understand, nor that I fully understand its documentation - i.e. how it's supposed to behave. That said, please take a look at "Specifying multiple fields" section in Transaction Search Reference to see that the logical event defined by transaction's output is not supposed to comprise of events that each has all of the fields. It's not the case in my testing, and it's not how it's supposed to behave, per that article.

On a side note, when I run the same exact transaction command on the same group of events I did when I posted my question, it now correctly groups those events.

And no, I have no clue what's happening but one of the possibilities is that transaction is in "alpha" version - i.e. unpredictable behavior that is only partially documented. That in a way makes sense given how many people are suggesting not to use it, and use stats instead.

And yes, you're right, client= happens only once per group - that's the way Postfix logs events.

nickhills · ‎01-27-2020

Yes, its not immediately obvious from the docs - specifying 3 fields does not mean all 3 fields must exist, but IF they do, then the transaction will group based on them: (emphasis added)

"IF there is a transitive relationship between the fields in the fields list AND if the related events appear in the CORRECT sequence..."

In your case, host and trace appear in each event, but client is sporadic. maxspan could therefore become critical in satisfying a match.

Do you have satisfactory results by ommiting client?

If my comment helps, please give it a thumbs up!

mitag · ‎01-27-2020

I can't even reproduce the unexpected behavior in my original search - so all bets are off... 🙂

Like I said simply re-running my original search produces expected (correct) results today - I'll have to find another anomaly where the results are unexpected and can be reliably reproduced. But even before that I'll likely have to upgrade our instance to the latest version (from 7.1.2 to 8.0.1) just in case my issue is the result of a bug that's been squashed in subsequent releases.

Oh and the kicker is that some of the Postfix threads can run for weeks... Non-deliverable emails are deferred, then retried, then deferred for longer and so on... So maxspan and maxpause are of limited use here despite relative simplicity of the events.

nickhills · ‎01-27-2020

If you have lots of long running open transactions (ie your search window is large), you could be hitting memory limits causing evictions - your example above is all the same timeframe, but it could be contributing.

If my comment helps, please give it a thumbs up!

mitag · ‎01-27-2020

I hear you - in that case Splunk normally produces a warning that I am on the lookout for - but I have yet to see it running out of memory - and I did try to run it into the ground... 🙂

to4kawa · ‎01-23-2020

sourcetype="postfix_syslog" threadId=*
| stats values(_time) as _time values(host) as host values(size) as size values(client) as client values(from) as from values(to) as to values(status) as status range(_time) as duration count as eventcount values(relay) as relay values(message_id) as message_id by threadId
| table _time host threadId size client from to status duration eventcount relay message_id

hi, @mitag
This may be same results. try this.

mitag · ‎01-23-2020

Not the same results, with some errors and generally I'd like to learn to use transaction correctly vs. using stats. Is there a reason I consistently get advice to use stats when I ask questions about transaction (performance aside)?

to4kawa · ‎01-23-2020

yes. transaction is slow?
if it is not, I'm sorry.
with some errors

two days values are appeared?

to4kawa · ‎01-23-2020

some others break. Why?
maybe,the order of events with same time stamp is problem, startswith="client\="will not work.

mitag · ‎01-23-2020

deferred email threads can last several days and have something like this in the _time field:

1579731986,1579736186,1579740387,1579744587,1579748788,1579752988,1579757188,1579761388,1579765588,1579769789,1579773989,1579778189,1579782389,1579786589,1579790789,1579794989,1579799189,1579799190,1579803390,1579807590,1579811790,1579815991
...

startswith="client\=" will not work.

Any idea why? Seems to be working in most cases and it seems all postfix smtp threads start that way...

to4kawa · ‎01-24-2020

Logs are processed in the order they come in, so the order may be mixed.

mitag · ‎01-24-2020

Except:

if I add message-id to the list of fields in transaction, the events are grouped properly

So it's still a mystery to me why it works the way it does.

to4kawa · ‎01-24-2020

Isn't that because it's an appropriate field to link away time?
By the way, why put it together?
If you just search, you don't have to.

to4kawa · ‎01-23-2020

Hi @mitag
transaction is necessary?
threadId is extracted. what's problem?

mitag · ‎01-23-2020

The goal is to extract a set of additional fields that are only available in transaction output. E.g.:

| table _time host threadId size client from to status duration eventcount relay message_id

(One could probably do something similar - and faster - via stats - yet my goal is to group events related to a single Postfix transaction together for subsequent analysis - which only transaction can do.)

transaction: inconsistent results

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms

Updated Team Landing Page in Splunk Observability