Splunk Search

How can we improve the performance of an Hunk's query?

ddrillic
Ultra Champion

We have the following Hunk query -

index=<claims_table> claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
| eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,"montefiore"," all others") 
| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group 
| eval avg=round(avg,0)
| eventstats sum(count) as total by drg_code 
| sort 20 -total 
| xyseries drg_code prov_group avg count 
| addtotals fieldname=grandtotals 
| sort -grandtotals 

The claims_table is of two billion claims and this query is running at a pace of a million claims per minute.

Any ideas how to improve it?

Tags (3)
0 Karma
1 Solution

Claw
Splunk Employee
Splunk Employee

You are assigning field results to reserved words.

As in

| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group
below I have changed it to this

| stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

Try this instead.

index=<claims_table> claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A"       drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group 
 | eval myavg=round(myavg,0)
 | eventstats sum(mycount) as mytotal by drg_code 
 | sort 20 -mytotal 
 | xyseries drg_code prov_group myavg mycount 
 | addtotals fieldname=grandtotals 
 | sort -grandtotals

After that start from the initial search and test each step to see if you get results.

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others")

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

until you have the entire search running.

View solution in original post

DalJeanis
Legend

Try a time trial on this, against the most recent 4 days of data -

   index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND (prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> 
   OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> 
   OR prov_tin=<tin11> OR prov_tin=<tin12>)
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | stats count as mycount , mean(net_pd_amt) as myavg by drg_code
   | sort 30 -mycount 
   | eval prov_group="montefiore"
   | append 
       [index=<claims_table> 
       (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
       AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
       AND NOT (prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> 
           OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> 
           OR prov_tin=<tin11> OR prov_tin=<tin12>)
       AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
       | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code
       | sort 30 -mycount 
       | eval prov_group=" all others"
       ]
    | eval myavg=round(myavg,0)
    | eventstats sum(mycount) as mytotal by drg_code 
    | sort 20 -mytotal 

... against this...

  index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> 
                     OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR prov_tin=<tin8>  
                     OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,
                    "montefiore"," all others") 
   | stats count as mycount, mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal 

.. and against this...

  index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | join type=left  prov_tin 
        [| makeresults | eval prov_tin="<tin1> <tin2> <tin3> <tin4> <tin5> <tin6> <tin7> <tin8> <tin9> <tin10> <tin11> <tin12>" | makemv prov_tin | mvexpand prov_tin 
        | eval prov_group="montefiore" | table prov_tin prov_group
        ]
   | eval prov_group=coalesce(prov_group," all others")
   | stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal 

Discussion -

Performance of each one of these is going to be data- and architecture-dependent.

The top one assumes that the top 20 overall have to be in the top 30 of each prov_group, and assumes that tins are an indexed field. It is hoping to save calculation overall by doing the stat work at the prov_group level. Efficiency gains will be dependent on the sparsity of the tins under inspection.

The second one is basically yours with the changes suggested by Claw.

The third one tests a lookup as a way of assigning the prov_group. Efficiency might be significantly better or worse as a join, depending on the cost of sorting compared to the cost of twelve individual compares for each record.


The real BEST solution is likely to consist of some of the following -

1) turn (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 ) into a single indexed field so you can use it for information at the summary level
2) create a summary index of the data you are likely to want to reuse.
3) since you only want the top 20 drg_code by overall count, collect only that information FIRST, then scan for only those. DANG, I should have coded that one. hang on...


If this works, the following should return the top 30 drug codes for a 3 month period extremely fast...

| tstats count 
where (index=<claims_table>)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time >= strptime("2016-09-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-30","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count

...If so, then try this....

| tstats count 
where (index=<claims_table>)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time >= strptime("2015-01-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-29","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count

...and then this...

 index=<claims_table> 
(claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
AND [| tstats count 
    where (index=<claims_table>)   
    AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
    AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
    AND (_time >= strptime("2015-01-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-29","%Y-%m-%d"))
      by drg_code 
    | rename count as mycount
    | sort 30 -count
    | table drg_code
    ]
| join type=left  prov_tin 
     [| makeresults | eval prov_tin="<tin1> <tin2> <tin3> <tin4> <tin5> <tin6> <tin7> <tin8> <tin9> <tin10> <tin11> <tin12>" | makemv prov_tin | mvexpand prov_tin 
     | eval prov_group="montefiore" | table prov_tin prov_group
     ]
| eval prov_group=coalesce(prov_group," all others")
| stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
| eval myavg=round(myavg,0)
| eventstats sum(mycount) as mytotal by drg_code 
| sort 20 -mytotal 
0 Karma

Claw
Splunk Employee
Splunk Employee

PS. This data is against Hadoop Data via Hunk and there are no date time indexes. So the beautiful append is you wrote in the first example is going to take in this environment probably 3 or 4 hours to run.

0 Karma

DalJeanis
Legend

ROFL. Well, then... I'd have to go back to solution 2 - create a summary index of whatever part of this stuff is likely to be reusable. I'd probably tend to aggregate it to the tin/drug-cd/day level at each unit (whatever term hadoop refers to an indexer or a chunk of distributed database)

0 Karma

Claw
Splunk Employee
Splunk Employee

You are assigning field results to reserved words.

As in

| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group
below I have changed it to this

| stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

Try this instead.

index=<claims_table> claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A"       drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group 
 | eval myavg=round(myavg,0)
 | eventstats sum(mycount) as mytotal by drg_code 
 | sort 20 -mytotal 
 | xyseries drg_code prov_group myavg mycount 
 | addtotals fieldname=grandtotals 
 | sort -grandtotals

After that start from the initial search and test each step to see if you get results.

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others")

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

until you have the entire search running.

ddrillic
Ultra Champion

Thank you Claw!!!

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...