Hello everyone,
I am currently struggling with Splunk limitations when it comes to joining two queries handling very large datasets (the two sets are two large to hold within the limits.conf limitations and I only get partial results).
I am trying to find out for each entity if a first property is true for each hour, and if this first property is true, find out in these valid hours the value of a second property to finally perform its sum per entity. Each query information is present in a dedicated datamodel, which I need to join using the entity id + the time (to get the hourly result). Unfortunately, the query gives me only partial results since the first data model contains more than 2 millions events, while the second data model contains approximately 35 millions. The computation is done over a week.
The result of these two queries is then summed up to provide the count of hours that contain the second property per entity.
Below is what I am trying to achieve:
| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| fields _time, entity, is_valid
| join _time, entity type=left [
| tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_true=if(second_events > 0, 1, 0)
| fields _time, entity, is_true
]
| fillnull is_true
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity
I might not be using the best possible approach or might be missing something. I would be extremely glad if you could provide me with suggestions to achieve what I am trying to do.
Thank you very much!
Have you tried tstats append=t
?
If that won't work for you (I am not sure about limits there), then there is only one way to losslessly join datasets, do it like this:
| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| eval DATASET="1"
| fields _time, entity, is_valid DATASET
| appendpipe [
| tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_true=if(second_events > 0, 1, 0)
| eval DATASET="2"
| fields _time, entity, is_valid DATASET
]
| fillnull is_true
| stats values(*) AS * dc(DATASET) AS numDatasets BY _time entity
| rename COMMENT AS "put your join logic here: use '|search DATASET=1' for left join, '|search DATASET=2' for right join, '|search numDatasets=1' for disunion, '|search numDatasets=2 for intersection"
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity
You might want to remove the prestats=t from the second tstats.
It is required when using append=t
.
Thank you very much for your answer.
Unfortunately, your query does not work, when I search DATASET=2, it produces no result (it looks like only the first dataset is kept). I obviously tried to run the query separately and it outputs results properly, only when appendpipe is used that it does not work.
I have also tried to use prestats like below but it does not output any other properties than _time and mount_id:
| tstats prestats=t values(main.valid_event) as is_valid_hour from datamodel=hour_validity where nodename=main AND main.is_valid_hour=true groupby _time span=1h, main.entity
| tstats prestats=t append=t count as true_events from datamodel=true_events where nodename=main groupby _time span=1h, main.entity
| table _time, entity, is_valid_hour, true_events
Your situation makes no sense to me. Are you sure that you are describing everything accurately?