Splunk Search

Issue when joining results of large tstats queries over time.

moystard
New Member

Hello everyone,

I am currently struggling with Splunk limitations when it comes to joining two queries handling very large datasets (the two sets are two large to hold within the limits.conf limitations and I only get partial results).

I am trying to find out for each entity if a first property is true for each hour, and if this first property is true, find out in these valid hours the value of a second property to finally perform its sum per entity. Each query information is present in a dedicated datamodel, which I need to join using the entity id + the time (to get the hourly result). Unfortunately, the query gives me only partial results since the first data model contains more than 2 millions events, while the second data model contains approximately 35 millions. The computation is done over a week.

The result of these two queries is then summed up to provide the count of hours that contain the second property per entity.

Below is what I am trying to achieve:

| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| fields _time, entity, is_valid
| join _time, entity type=left [
    | tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
    | rename main.entity as entity
    | eval is_true=if(second_events > 0, 1, 0)
    | fields _time, entity, is_true
]
| fillnull is_true
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity

I might not be using the best possible approach or might be missing something. I would be extremely glad if you could provide me with suggestions to achieve what I am trying to do.

Thank you very much!

0 Karma

woodcock
Esteemed Legend

Have you tried tstats append=t?
If that won't work for you (I am not sure about limits there), then there is only one way to losslessly join datasets, do it like this:

| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| eval DATASET="1"
| fields _time, entity, is_valid DATASET
| appendpipe [
    | tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
    | rename main.entity as entity
    | eval is_true=if(second_events > 0, 1, 0)
    | eval DATASET="2"
    | fields _time, entity, is_valid DATASET
]
| fillnull is_true
| stats values(*) AS * dc(DATASET) AS numDatasets BY _time entity
| rename COMMENT AS "put your join logic here: use '|search DATASET=1' for left join, '|search DATASET=2' for right join, '|search numDatasets=1' for disunion, '|search numDatasets=2 for intersection"
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity
0 Karma

prats84
Explorer

You might want to remove the prestats=t from the second tstats.

0 Karma

woodcock
Esteemed Legend

It is required when using append=t.

0 Karma

moystard
New Member

Thank you very much for your answer.

Unfortunately, your query does not work, when I search DATASET=2, it produces no result (it looks like only the first dataset is kept). I obviously tried to run the query separately and it outputs results properly, only when appendpipe is used that it does not work.

I have also tried to use prestats like below but it does not output any other properties than _time and mount_id:

| tstats prestats=t values(main.valid_event) as is_valid_hour from datamodel=hour_validity where nodename=main AND main.is_valid_hour=true groupby _time span=1h, main.entity 
| tstats prestats=t append=t count as true_events from datamodel=true_events where nodename=main groupby _time span=1h, main.entity 
| table _time, entity, is_valid_hour, true_events
0 Karma

woodcock
Esteemed Legend

Your situation makes no sense to me. Are you sure that you are describing everything accurately?

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...