Getting Data In

Why am i seeing indexing lag???

rjyetter
Path Finder

I have 11 indexing servers all with 16 cpu's RAID 10 configuration 1Gb full duplex no swap useage, and they all sit at about 60-80% idle. Something is not right that I am seeing indexing lag of up to 5 hours on most of the servers. Are there any tuning parameters that I need check within splunk to have better throughput on the indexer side?

Thanks,

Rick

hexx
Splunk Employee
Splunk Employee

Some suggestions :

1 - Install the Splunk on Splunk app on your search-head. Take a look at the "Indexing Performance" view. Are all queues blocked down to the indexer queue or is there blockage upstream of that?

2 - In the SoS app "Indexing Performance" view, do you see latency across the board in the "Measured indexing latency" table at the top of the page, or is it only affecting a subset of hosts/sourcetypes/sources/splunk_server?

3 - In the SoS app "Errors" view, do you see any reports of indexing throttling because some bucket may contain too many tsdix files?

4 - We might want to check the size of your metadata files, particularly your Sources.data. Run the following command against your $SPLUNK_DB and report the output :

find $SPLUNK_DB -name "*.data" -maxdepth 3 -size +25M | xargs ls -lh

You can get $SPLUNK_DB from $SPLUNK_HOME/etc/splunk-launcher.conf. By default, $SPLUNK_DB is set to $SPLUNK_HOME/var/lib/splunk.

If this command finds any metadata files larger than 25MB, that could be one of the reasons for your indexing performance degradation.

rjyetter
Path Finder

Stupid stupid stupid!!

One of my sources is from across the pond in a different timezone - hence massive indexing lag based on timezone -

0 Karma

JuhiSaxena
Explorer

hi Rjyetter, how was the timezone issue resolved. I'm also facing similar issue.

0 Karma

gjanders
SplunkTrust
SplunkTrust

@JuhiSaxena , you will be better served by creating a new post, this one is from 2012

0 Karma

JuhiSaxena
Explorer
0 Karma

hexx
Splunk Employee
Splunk Employee

Aha! Good catch 🙂 Are you all set, then?

0 Karma

wwhitener
Communicator

Might also check to make double sure that the times are synched up. Had one that was set for a different time and is made everything wonky.

0 Karma

rjyetter
Path Finder

Time synch is fine - all servers are within a second of each other - write I/O wait is less than 1 second on each server.

0 Karma

Simeon
Splunk Employee
Splunk Employee

Rick, this could be due to a few things.

First - check the index time to be sure Splunk is seeing and indexing the data later then expecting. Below, the search will find the delay in seconds:

source=<your delayed source> | eval delay=_indextime-_time | fields delay

Typically, delays of hours will mean that the indexer is backed up OR the data is being read in at a slower pace then expected. To see if the indexer is backed up (we call it blocked), search as follows:

index=_internal source=*metrics.log blocked

If this returns events, this means the system is being blocked. If there are a lot of them, that is not a good sign and you should contact support. Support can determine if it's disk speed or something else by identifying the particular part of the queue system that is backed up.

Another thing to check is the maximum thruput for the indexers and forwarders. There is a maximum thruput setting within limits.conf that will be set to 256 kb per second on light weight forwarders:

[thruput]
maxKBps = 256

Simeon
Splunk Employee
Splunk Employee

The thruput will be applied on indexers or forwarders, meaning any splunk instance.

The distribution of the blocked events and which queue they are from will tell us where to look next. If you are constantly adding new data sets and they are very large, then I suspect you need to tune some of the new inputs so they are parsed faster. It is also possible that your disks just can't keep up with your thruput (particularly if it is running well over 3 mb/sec of indexing thruput per indexer).

0 Karma

tskinnerivsec
Contributor

you could run a search like this:

index=_internal host=* source=*metrics.log group=queue blocked=true | rename host as Indexer | chart count(blocked) as "Queue Blocks" by Indexer, name

to create a chart of the count of blocks by indexer and queue.

rjyetter
Path Finder

So what I have now is 800 events for index=_internal source=*metrics.log blocked - is there a setting for index throughput? Problem is we're adding new sources daily and need to figure this out before the lag starts affecting more realtime searches. Calling support now!

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...