Solved: Differences between two sourcetype according to a ...

lufermalgo · ‎09-20-2014

I need to know what events are on the sourcetype A that are not in the sourcetype B.

the query must evaluate more than 1 million events

consecutive   consecutive
000234456     000234456
000977832     000977832
000383748     000183846
000773732     000773732
000859484     000339274
000272849     000646473
000263664     000263664
000183846
000339274
000646473
sourcetypeA   sourcetypeB

martin_mueller · ‎09-20-2014

Replacing the count with a dc(sourcetype) would lift that restriction... however, the question implies that events in B but not in A aren't interesting, which would not be distinguishable with count or dc(sourcetype) alone. Here's how to solve that:

sourcetype=A OR sourcetype=B | stats values(sourcetype) as sourcetypes by consecutive | search sourcetypes=A NOT sourcetypes=B

Note, this will work for millions of events but might break, slow down, or at least consume a lot of memory for millions of distinct values of consecutive... stats basically has to keep millions of buckets around in that case.
Is there any relationship between the timestamps of A and B? If so, using transaction with short spans might use massively less memory:

sourcetype=A OR sourcetype=B | transaction consecutive maxspan=1m keepevicted=t | search sourcetype=A NOT sourcetypes=B

In theory there's also this:

sourcetype=A NOT [search sourcetype=B | dedup consecutive | fields consecutive]

But that again might (will) break for millions of distinct consecutive values.

View solution in original post

lufermalgo · ‎09-20-2014

Thanks @martin_mueller and @gpradeepkumarreddy for their contributions.

Take one of the recommendations of @martin_mueller and Realize the following query:

index=ilt sourcetype=boleta_generada OR sourcetype=boleta_indexado TIPO_DOCUMENTO="BE PEDIDO" CAMPANA>="201411" | dedup CONSECUTIVO WF_ESTADO | eventstats values(sourcetype) as sourcetypes by CONSECUTIVO | search sourcetypes=boleta_generada NOT sourcetypes=boleta_indexado | stats count

Effectively this give me the expected result but equally was very slow to deliver the answer.

Martin_mueller question about whether I could create a transaction with the transaction command, could not, as the events in the sourcetype=boleta_indexado can crearce after many days.

I want to share with you that I did another query graph and I'm getting basically what I would like to integrate other bar to show me the difference:

index=ilt sourcetype=boleta_generada OR sourcetype=boleta_indexado TIPO_DOCUMENTO="BE PEDIDO" (CAMPANA>="201408") | dedup CONSECUTIVO WF_ESTADO | chart count over CAMPANA by WF_ESTADO | sort -WF_ESTADO

chart:
https://drive.google.com/file/d/0B1HsOnBT01xZWDcyY2dTNUUwZFE/edit?usp=sharing

The WF_ESTADO field contains the values:

GENERATED
INDEXED

I would like the other bar that was called MISSING paint.

martin_mueller · ‎09-20-2014

Replacing the count with a dc(sourcetype) would lift that restriction... however, the question implies that events in B but not in A aren't interesting, which would not be distinguishable with count or dc(sourcetype) alone. Here's how to solve that:

sourcetype=A OR sourcetype=B | stats values(sourcetype) as sourcetypes by consecutive | search sourcetypes=A NOT sourcetypes=B

Note, this will work for millions of events but might break, slow down, or at least consume a lot of memory for millions of distinct values of consecutive... stats basically has to keep millions of buckets around in that case.
Is there any relationship between the timestamps of A and B? If so, using transaction with short spans might use massively less memory:

sourcetype=A OR sourcetype=B | transaction consecutive maxspan=1m keepevicted=t | search sourcetype=A NOT sourcetypes=B

In theory there's also this:

sourcetype=A NOT [search sourcetype=B | dedup consecutive | fields consecutive]

But that again might (will) break for millions of distinct consecutive values.

pradeepkumarg · ‎09-20-2014

if the value exists in both the sourcetypes do they exist more than once in each sourcetype? If not, below query might help you

sourcetype=sourcetypeA OR sourcetype=sourcetypeB | stats count by consecutive | search count < 2

Differences between two sourcetype according to a field

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life