Re: Issue when joining results of large tstats que...

moystard · ‎05-11-2017

Hello everyone,

I am currently struggling with Splunk limitations when it comes to joining two queries handling very large datasets (the two sets are two large to hold within the limits.conf limitations and I only get partial results).

I am trying to find out for each entity if a first property is true for each hour, and if this first property is true, find out in these valid hours the value of a second property to finally perform its sum per entity. Each query information is present in a dedicated datamodel, which I need to join using the entity id + the time (to get the hourly result). Unfortunately, the query gives me only partial results since the first data model contains more than 2 millions events, while the second data model contains approximately 35 millions. The computation is done over a week.

The result of these two queries is then summed up to provide the count of hours that contain the second property per entity.

Below is what I am trying to achieve:

| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| fields _time, entity, is_valid
| join _time, entity type=left [
    | tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
    | rename main.entity as entity
    | eval is_true=if(second_events > 0, 1, 0)
    | fields _time, entity, is_true
]
| fillnull is_true
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity

I might not be using the best possible approach or might be missing something. I would be extremely glad if you could provide me with suggestions to achieve what I am trying to do.

Thank you very much!

woodcock · ‎05-11-2017

Have you tried tstats append=t?
If that won't work for you (I am not sure about limits there), then there is only one way to losslessly join datasets, do it like this:

| tstats values(main.is_valid) as is_valid from datamodel=validity where nodename=main AND main.is_valid=true groupby _time span=1h, main.entity
| rename main.entity as entity
| eval is_valid=if(is_valid LIKE "%true%", 1, 0)
| eval DATASET="1"
| fields _time, entity, is_valid DATASET
| appendpipe [
    | tstats count as second_events from datamodel=secondary where nodename=main groupby _time span=1h, main.entity
    | rename main.entity as entity
    | eval is_true=if(second_events > 0, 1, 0)
    | eval DATASET="2"
    | fields _time, entity, is_valid DATASET
]
| fillnull is_true
| stats values(*) AS * dc(DATASET) AS numDatasets BY _time entity
| rename COMMENT AS "put your join logic here: use '|search DATASET=1' for left join, '|search DATASET=2' for right join, '|search numDatasets=1' for disunion, '|search numDatasets=2 for intersection"
| fields _time, entity, is_valid, is_true
| stats sum(is_true) by entity

prats84 · ‎07-17-2017

You might want to remove the prestats=t from the second tstats.

woodcock · ‎12-04-2017

It is required when using append=t.

moystard · ‎05-12-2017

Thank you very much for your answer.

Unfortunately, your query does not work, when I search DATASET=2, it produces no result (it looks like only the first dataset is kept). I obviously tried to run the query separately and it outputs results properly, only when appendpipe is used that it does not work.

I have also tried to use prestats like below but it does not output any other properties than _time and mount_id:

| tstats prestats=t values(main.valid_event) as is_valid_hour from datamodel=hour_validity where nodename=main AND main.is_valid_hour=true groupby _time span=1h, main.entity 
| tstats prestats=t append=t count as true_events from datamodel=true_events where nodename=main groupby _time span=1h, main.entity 
| table _time, entity, is_valid_hour, true_events

woodcock · ‎07-07-2017

Your situation makes no sense to me. Are you sure that you are describing everything accurately?

Issue when joining results of large tstats queries over time.

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk