Solved: What in this search causes it to take so long?

hagjos43 · ‎10-20-2014

Is there any suggestions on how to improve search time on this particular search? This search literally takes 12-15 hours to complete when searching over the past year. Other searches do not take anywhere close to this long for the same amount of time (seconds to minutes).

The search is intended to return a search indicating the total page views per month (sorted by month):

sourcetype=IIS |dedup JSESSIONID| eval month_num=strftime(_time, "%m") | stats count(Page) by month_num, date_month | sort month_num | fields - month_num

hagjos43 · ‎10-24-2014

I got it!

DC worked, but I also did it in buckets of time which significantly improved the speed. Thank you all for your input.

The final query looks like this:

sourcetype=iis| bucket _time span=1mon | eval Month=strftime(_time,"%B") |eval MN=lower(Month) | eval sort_field=case(MN=="january",1, MN=="february",2, MN=="march",3, MN=="april",4, MN=="may",5, MN=="june",6, MN=="july",7, MN=="august",8, MN=="september",9, MN=="october",10, MN=="november",11, MN=="december",12) | stats dc(JSESSIONID) as Visits by Month, sort_field | sort +sort_field | fields - sort_field

View solution in original post

sowings · ‎04-07-2015

The position of dedup in your search is painful. It a reporting command, meaning that everything after that has to be done on the SH after receiving it all from the peers. Stats can do dedup for you, as you've figured out in the accepted answer. All of the eval stuff you do in your final search is streaming, meaning it can be parallelized just fine. It was more about the position of dedup that was bogging the search down.

hagjos43 · ‎10-24-2014

I got it!

DC worked, but I also did it in buckets of time which significantly improved the speed. Thank you all for your input.

The final query looks like this:

sourcetype=iis| bucket _time span=1mon | eval Month=strftime(_time,"%B") |eval MN=lower(Month) | eval sort_field=case(MN=="january",1, MN=="february",2, MN=="march",3, MN=="april",4, MN=="may",5, MN=="june",6, MN=="july",7, MN=="august",8, MN=="september",9, MN=="october",10, MN=="november",11, MN=="december",12) | stats dc(JSESSIONID) as Visits by Month, sort_field | sort +sort_field | fields - sort_field

skawasaki_splun · ‎10-24-2014

This is the search you probably want: sourcetype=iis | timechart span=1mon dc(JSESSIONID). Also even if you're satisfied by the performance of this search, I would still highly recommend that you lookin into summary indexing. Any large aggregation over a large time (months, years, etc.) should probably be using a summary index.

skawasaki_splun · ‎10-24-2014

You don't need to create an evaluated field for months. Use date_month instead (an internal, default field).

skawasaki_splun · ‎10-21-2014

Create a daily (or a time range of your choice) summary index:

sourcetype=IIS J SESSIONID=* Page=* | sistats dc(JSESSIONID)

Send this to another index like index=iis_summary. Then search the summary like this:

index=iis_summary | timechart span=1mon dc(JSESSIONID)

Pretend you have 1 million events per day, now instead of searching ~365 million events per year, you'll only be searching 365 events if you did a daily summary.

More info here: http://docs.splunk.com/Documentation/Splunk/6.1.4/Knowledge/Usesummaryindexing

the_wolverine · ‎10-20-2014

You could try filtering further before your dedup to reduce the size of the initial dataset:

sourcetype=IIS Page=* | dedup JSESSIONID | etc.

Also, a year's worth of data is a lot of data. You might want to consider summary indexing to get faster reporting if this is a search that is run regularly.

martin_mueller · ‎10-20-2014

I agree, dedup'ing a year's worth of data in one go is killing ALL the memory... and that's on top of actually loading, decompressing, and working through a year's worth of data in one go.

If you expect only few session IDs to survive over a long time you could approximate your search like this:

sourcetype=IIS JSESSIONID=* Page=* | eval month_num=strftime(_time, "%m") | stats dc(JSESSIONID) by month_num, date_month | sort month_num | fields - month_num

A couple of IDs will be counted twice if they are created before the month ends and destroyed after the new month starts, but the difference will be small if your IDs are short-lived.
A key benefit can be that you're not dedup'ing over the entire year and events, but rather counting by ID and month. Once the search is done with a month it can forget about it.

hagjos43 · ‎10-21-2014

Martin,
Thanks for the response. I'm giving it a shot now, it appears to be either the same speed as dedup or marginally faster.

Anyone else have any ideas on this one? or if anyone has a query developed that tracks visitors to a website over any period of time I'd greatly appreciate having it!

Thanks,
-hagjos43

hagjos43 · ‎10-20-2014

I looked into it, and it appears that this needs to be a scheduled search. Does summary indexing require a static time-frame? This is a Web Analytics report, and it is typically launched by marketing teams who can run this for any time-range, will summary indexes operate in such an environment?

What in this search causes it to take so long?

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes