Monitoring Splunk

What are possible reasons for performance issues (in the application layer)?

sdorich
Communicator

Hi Splunk experts! I desperately need some help here.

So I'm having very very poor performance issues. It takes up to 5 minutes or longer to load the search results on certain dashboards. I want to demo Splunk to potential customers (we're a partner) and the performance I'm seeing is not impressive.

The first answer that'll probably come to mind would be hardware limitations or performance bottlenecks. BUT I've made sure that this is not the case. So the performance problem I'm having must be on the app layer.

I tried changing some of the default values in limits.conf but this did not help. For example, I noticed that max_combiner_memevents was quite low relative to the resources we have so I increased this.

Any ideas are much appreciated. If necessary, below is info about our Splunk environment:

Hardware:
-running alone on a physical Linux server (Red Hat Enterprise Linux Server release 6.5 (Santiago))
-4 CPU's (8 logical CPU's)

Splunk:
-Enterprise
-1 indexer
-largest index (this is the most important one for us and performance issues appear mostly when base search uses this index): 8,598MB (50,333,181 events)

Last note: I suppose 50 million events is a lot but my company is much smaller (like less than 20 ppl work for us) than the companies we service so seeing this kind of performance in our environment is scary. We also want to avoid clusters for the time being because again this is NOT an outcome of any kind of hardware limitation.. Please help!

Thanks

Tags (1)
0 Karma

sideview
SplunkTrust
SplunkTrust

the search mode of smart/verbose/fast is not relevant in advanced XML because the framework handles the choice automatically. Basically if one or more modules are present in your config that implicitly requires a slower mode to be used, then that search will be dispatched with that slower mode. If you aren't using a reasonably recent Sideview Utils, the framework may be dispatching some searches inefficiently, but it's basically automatic.
If you pastebin your XML I can comment as to what changes you could make.

0 Karma

sdorich
Communicator

I have not tried changing the search mode to "fast". How would I do that in advanced xml? And I'm extracting many fields (in props.conf) at search time but that's about it.

0 Karma

Rob
Splunk Employee
Splunk Employee

Do you have anything in your props.conf that you are doing at search time?

Have you tried changing the search from "smart mode"/"verbose mode" to "fast mode"?

0 Karma

sdorich
Communicator

We are using Raid 0, The output below is the disk IO with 1 sec poll intervals during a search:

Linux 2.6.32-431.el6.x86_64 (slc-srv-splunk.glg.local) 03/28/2014 x86_64 (4 CPU)

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 8.89 9.65 113.38 5699919 66941072
dm-0 28.62 9.63 113.38 5684777 66940924
dm-1 0.00 0.00 0.00 1288 0
dm-2 0.00 0.00 0.00 925 132

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 34.00 0.00 492.00 0 492
dm-0 123.00 0.00 492.00 0 492
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 3.00 0.00 12.00 0 12
dm-0 3.00 0.00 12.00 0 12
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 17.00 0.00 640.00 0 640
dm-0 160.00 0.00 640.00 0 640
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 15.00 0.00 600.00 0 600
dm-0 150.00 0.00 600.00 0 600
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 15.00 0.00 232.00 0 232
dm-0 58.00 0.00 232.00 0 232
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 15.00 0.00 544.00 0 544
dm-0 136.00 0.00 544.00 0 544
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
cciss/c0d0 11.00 0.00 92.00 0 92
dm-0 23.00 0.00 92.00 0 92
dm-1 0.00 0.00 0.00 0 0
dm-2 0.00 0.00 0.00 0 0

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

There's nothing particularly telling either way in that iostats output.

Maybe you should file a proper support ticket, or talk to your channel enablement team about some hands-on support... or just use a different box for your demos.

0 Karma

sdorich
Communicator

ok so how about the output I just posted.. doesn't this look fine implying there is no issue with the server and thus it must be something on the Splunk app level.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

While RAID 5 isn't as fast as RAID 1+0, it's not going to throw you back into floppy disk speeds.

0 Karma

sdorich
Communicator

Job inspector output for one of the slow searches (note that it only took 14 sec. but that's only for the last 15 minutes)

This search has completed and has returned 23 results by scanning 309 events in 13.604 seconds.

The following messages were returned by the search subsystem:

DEBUG: base lispy: [ AND index::sitescope sourcetype::sitescope_daily_log ]
DEBUG: search context: user="admin", app="glg", bs-pathname="/opt/splunk/etc"
(SID: 1395958659.213)

Execution costs

Duration (seconds) Component Invocations Input count Output count
0.002 command.fields 4 38 38
0.001 command.presort 1 23 23
0.001 command.rename 1 23 23
13.403 command.search 4 - 38
0.005 command.search.index 4 - -
0.004 command.search.filter 3 - -
0.003 command.search.fieldalias 3 309 309
0.002 command.search.calcfields 3 309 309
0 command.search.index.usec_1_8 50 - -
0 command.search.index.usec_8_64 2 - -
13.382 command.search.kv 3 - -
0.013 command.search.rawdata 3 - -
0.006 command.search.typer 4 38 38
0.004 command.search.tags 4 38 38
0.003 command.search.lookups 3 309 309
0 command.search.summary 4 - -
0.001 command.sort 1 23 23
0.004 command.stats.execute_input 5 - -
0.001 command.stats.execute_output 1 - -
0.001 command.table 1 23 46
0.002 dispatch.check_disk_usage 2 - -
0.056 dispatch.createProviderQueue 1 - -
0.12 dispatch.evaluate 1 - -
0.119 dispatch.evaluate.search 1 - -
0.001 dispatch.evaluate.rename 1 - -
0.001 dispatch.evaluate.sort 1 - -
0.001 dispatch.evaluate.stats 1 - -
0.001 dispatch.evaluate.table 1 - -
13.405 dispatch.fetch 5 - -
0.001 dispatch.preview 1 - -
0.001 dispatch.preview.command.presort 1 23 23
0.001 dispatch.preview.command.rename 1 23 23
0.001 dispatch.preview.command.sort 1 23 23
0.001 dispatch.preview.command.table 1 23 46
0.001 dispatch.preview.stats.execute_output 1 - -
0.001 dispatch.preview.write_results_to_disk 1 - -
13.403 dispatch.stream.local 4 - -
0.016 dispatch.timeline 5 - -
0.013 dispatch.writeStatus 7 - -
0.062 startup.handoff 1 - -
Search job properties

canSummarize 1
createTime 2014-03-27T16:17:39.000-06:00
cursorTime 2014-03-27T16:02:00.000-06:00
custom {'dispatch.earliest_time': '-15m', 'display.general.type': 'events', 'search': 'index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=* | stats count by sis_monitor_host sis_monitor_status | table sis_monitor_host sis_monitor_status count | sort sis_monitor_status,-count | rename sis_monitor_host as MonitoredHost, sis_monitor_status as "Status", count as "Count"', 'dispatch.latest_time': 'now'}
defaultSaveTTL 604800
defaultTTL 600
delegate None
diskUsage 237568
dispatchState DONE
doneProgress 1.0
dropCount 0
eai:acl {'sharing': 'global', 'app': 'glg', 'modifiable': '1', 'perms': {'read': ['admin'], 'write': ['admin']}, 'owner': 'admin', 'can_write': '1', 'ttl': '600'}
earliestTime 2014-03-27T16:02:39.000-06:00
eventAvailableCount 38
eventCount 38
eventFieldCount 30
eventIsStreaming True
eventIsTruncated False
eventSearch search index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=*
eventSorting desc
indexEarliestTime 1395957779
indexLatestTime 1395958558
isBatchModeSearch False
isDone True
isFailed False
isFinalized False
isGoodSummarizationCandidate 1
isPaused False
isPreviewEnabled True
isRealTimeSearch False
isRemoteTimeline False
isSaved False
isSavedSearch False
isZombie False
keywords index::sitescope sis_monitor_host::* sourcetype::sitescope_daily_log
label None
latestTime 2014-03-27T16:17:39.000-06:00
meanPreviewPeriod 13.604000
messages {'debug': ['base lispy: [ AND index::sitescope sourcetype::sitescope_daily_log ]', 'search context: user="admin", app="glg", bs-pathname="/opt/splunk/etc"']}
modifiedTime 2014-03-27T16:18:35.742-06:00
normalizedSearch litsearch index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=* | fields keepcolorder=t "" "bkt" "_cd" "_si" "host" "index" "linecount" "prestats_reserved" "psrsvd_" "sis_monitor_host" "sis_monitor_status" "source" "sourcetype" "splunk_server"
numPreviews 1
performance {'dispatch.evaluate.search': {'duration_secs': '0.119000', 'invocations': '1'}, 'command.search.index.usec_1_8': {'invocations': '50'}, 'command.search.typer': {'input_count': '38', 'invocations': '4', 'output_count': '38', 'duration_secs': '0.006000'}, 'command.search.fieldalias': {'input_count': '309', 'invocations': '3', 'output_count': '309', 'duration_secs': '0.003000'}, 'command.search.index': {'duration_secs': '0.005000', 'invocations': '4'}, 'dispatch.evaluate.sort': {'duration_secs': '0.001000', 'invocations': '1'}, 'dispatch.preview.command.sort': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'command.search.filter': {'duration_secs': '0.004000', 'invocations': '3'}, 'command.search.index.usec_8_64': {'invocations': '2'}, 'dispatch.preview.command.table': {'input_count': '23', 'invocations': '1', 'output_count': '46', 'duration_secs': '0.001000'}, 'dispatch.createProviderQueue': {'duration_secs': '0.056000', 'invocations': '1'}, 'command.rename': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'command.search.lookups': {'input_count': '309', 'invocations': '3', 'output_count': '309', 'duration_secs': '0.003000'}, 'dispatch.evaluate.rename': {'duration_secs': '0.001000', 'invocations': '1'}, 'command.search.calcfields': {'input_count': '309', 'invocations': '3', 'output_count': '309', 'duration_secs': '0.002000'}, 'command.fields': {'input_count': '38', 'invocations': '4', 'output_count': '38', 'duration_secs': '0.002000'}, 'dispatch.evaluate': {'duration_secs': '0.120000', 'invocations': '1'}, 'command.sort': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'startup.handoff': {'duration_secs': '0.062000', 'invocations': '1'}, 'command.stats.execute_output': {'duration_secs': '0.001000', 'invocations': '1'}, 'dispatch.preview.command.rename': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'dispatch.evaluate.stats': {'duration_secs': '0.001000', 'invocations': '1'}, 'command.table': {'input_count': '23', 'invocations': '1', 'output_count': '46', 'duration_secs': '0.001000'}, 'dispatch.check_disk_usage': {'duration_secs': '0.002000', 'invocations': '2'}, 'dispatch.evaluate.table': {'duration_secs': '0.001000', 'invocations': '1'}, 'dispatch.writeStatus': {'duration_secs': '0.013000', 'invocations': '7'}, 'command.search.kv': {'duration_secs': '13.382000', 'invocations': '3'}, 'dispatch.preview.stats.execute_output': {'duration_secs': '0.001000', 'invocations': '1'}, 'command.presort': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'command.search.tags': {'input_count': '38', 'invocations': '4', 'output_count': '38', 'duration_secs': '0.004000'}, 'dispatch.stream.local': {'duration_secs': '13.403000', 'invocations': '4'}, 'dispatch.preview.write_results_to_disk': {'duration_secs': '0.001000', 'invocations': '1'}, 'command.search.summary': {'invocations': '4'}, 'dispatch.fetch': {'duration_secs': '13.405000', 'invocations': '5'}, 'command.search.rawdata': {'duration_secs': '0.013000', 'invocations': '3'}, 'dispatch.preview': {'duration_secs': '0.001000', 'invocations': '1'}, 'command.stats.execute_input': {'duration_secs': '0.004000', 'invocations': '5'}, 'dispatch.preview.command.presort': {'input_count': '23', 'invocations': '1', 'output_count': '23', 'duration_secs': '0.001000'}, 'command.search': {'input_count': '0', 'invocations': '4', 'output_count': '38', 'duration_secs': '13.403000'}, 'dispatch.timeline': {'duration_secs': '0.016000', 'invocations': '5'}}
pid 8822
priority 5
reduceSearch sistats count by sis_monitor_host sis_monitor_status
remoteSearch litsearch index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=
| fields keepcolorder=t "" "bkt" "_cd" "_si" "host" "index" "linecount" "prestats_reserved" "psrsvd_" "sis_monitor_host" "sis_monitor_status" "source" "sourcetype" "splunk_server"
reportSearch stats count by sis_monitor_host sis_monitor_status | table sis_monitor_host sis_monitor_status count | sort sis_monitor_status,-count | rename sis_monitor_host as MonitoredHost, sis_monitor_status as "Status", count as "Count"
request {'auto_cancel': '30', 'adhoc_search_level': 'verbose', 'custom.dispatch.earliest_time': '-15m', 'status_buckets': '300', 'rf': '
', 'custom.dispatch.latest_time': 'now', 'indexedRealtime': None, 'search': 'search index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=* | stats count by sis_monitor_host sis_monitor_status | table sis_monitor_host sis_monitor_status count | sort sis_monitor_status,-count | rename sis_monitor_host as MonitoredHost, sis_monitor_status as "Status", count as "Count"', 'earliest_time': '-15m', 'custom.search': 'index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=* | stats count by sis_monitor_host sis_monitor_status | table sis_monitor_host sis_monitor_status count | sort sis_monitor_status,-count | rename sis_monitor_host as MonitoredHost, sis_monitor_status as "Status", count as "Count"', 'ui_dispatch_app': 'glg', 'preview': '1', 'latest_time': 'now'}
resultCount 23
resultIsStreaming False
resultPreviewCount 23
runDuration 13.604
runtime {'auto_cancel': '30', 'auto_pause': '0'}
scanCount 309
search search index=sitescope sourcetype=sitescope_daily_log sis_monitor_host=* | stats count by sis_monitor_host sis_monitor_status | table sis_monitor_host sis_monitor_status count | sort sis_monitor_status,-count | rename sis_monitor_host as MonitoredHost, sis_monitor_status as "Status", count as "Count"
searchEarliestTime 1395957759.000000000
searchLatestTime 1395958659.000000000
searchProviders ['slc-srv-splunk.glg.local']
sid 1395958659.213
statusBuckets 300
ttl 600
Additional info timeline field summary search.log

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Your dashboard searches behave like a scheduled search if someone scheduled them - check your list of saved searches (reports/alerts since v6) if there are any with a next scheduled execution time.

As for checking your disk, you can run IO benchmarks - see http://wiki.splunk.com/Community:HardwareTuningFactors for some info.

0 Karma

sdorich
Communicator

I have no scheduled searches. I just have multiple dashboards in an app I made. Is there any setting that would cause my dashboard searches to behave like a scheduled search?

Also, my "events" are more like one line a piece (they're not long at all).

Not sure if my Linux admin will be able to help me with disk storage; I asked him to check the health of my Splunk server yesterday and he said everything was "perfect"... is there any other direction I can go in or can anyone give me more specifics on how to check the health of my "disk storage"? Thanks.

0 Karma

sideview
SplunkTrust
SplunkTrust

If you have a lot of other things happening, like lots of scheduled searches going all the time, then you can be hammering disk and making bad disk IO even worse. Also if your "events" are 10,000 lines apiece. 😃 Just sort of top level ideas here.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Well, you can see from the bottom of the performance part in the job inspector that basically all the time is spent doing dispatch.stream.local - that means loading events off disk from the standalone indexer. For this kind of search my laptop does thousands of events per second, not dozens.

Your Linux admins should be able to check if your disks are working well and fast or not.

sdorich
Communicator

Ok.. I'm assuming this is a Linux admin deal. Is that correct or is what you're talking about ("disk storage working correctly") specific to Splunk? If so, would you be able to direct me to "best-practice" documentation for setting disk storage up? Thanks.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I see. Of the total 13.6s your search spend 13.4s on fetching 309 events off disk - that's very slow indeed.

Make sure your disk storage is working as it should... from a Splunk POV I'm a bit lost as to where that time is spent.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Take a screenshot of the job inspector for that search and append it to your question. The interesting bits are the text lines at the top and the bars below that, you can omit the long table at the bottom.

0 Karma

sdorich
Communicator

Oh and the data is just a typical log file (every new line is a new event). I can post a sample event if needed but I feel like the data itself is irrelevant to my problem since Splunk is meant for searching through log files.

0 Karma

sdorich
Communicator

I have a time range picker for the timerange of all my dashboards (I've created everything using advanced xml).. one search that takes forever is

...| stats count by x y | table x y count | sort - count | rename ...

This search takes a long time (like more than a minute) even for a timerange of the last 15 minutes. It wouldn't make sense that it's an inefficient search b/c I've used searches similar to this many times and it's usually not slow (not to mention this is a pretty simple search).

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

50 million events in an index usually isn't a lot on a dedicated Splunk box.

Usually performance issues are caused by inefficient searches, do post the searches you're running, their timeranges, and maybe some sample data.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...