Solved: Missing data - Splunk is showing random gaps in th...

mctester · ‎05-10-2011

My Splunk instance is constantly indexing data 24*7, but I've noticed some gaps in the indexed data timeline recently. I have also noticed that data I could search on yesterday is not being returned today. This doesn't happen consistently, but regularly enough to cause concern. I looked in splunkd.log and index=_internal to ensure that the buckets have not rotated out of the DB, and also confirmed that the buckets spanning the time period of the gap are present and in good shape. What else can I do to track down this missing data?

In splunkd.log I see the following:

05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata

My environment consists of 4 indexers running 4.2, 300 UF instances (also 4.2) and a standalone deployment server, also 4.2. We use the deployment server to manage the configs of all instances.

yannK · ‎05-10-2011

Hi this may be caused by defect SPL-39127 on Splunk 4.2.0 and 4.2.1.
This is caused by a push from the deployment server which restarts Splunkweb on indexers (includes search heads that are performing summary indexing) that are deployment clients.

If Splunk is still at 4.2, first apply the latest 4.2.1 release which fixes an associated defect SPL-38464 where in rare cases, concurrent hash table and string length collisions for metadata field values can cause index-level metadata files to grow to very large sizes, up to several gigabytes.

Reference: http://splunk.com/base/Documentation/4.2/ReleaseNotes/Knownissues

If you encounter this problem, please file a case at splunk support.
http://www.splunk.com/support

To find if this is the case, search in the splunkd.log logs look for something like :

05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata

To find those errors in the internal logs, (and the indexer in case of search-peers), you can use this search :



index=_internal host="indexer hostname" source=splunkd.log safeService | rex " MetaData - (?P.*)/" | stats count by bucket splunk_server

Here is the manual procedure to fix.

Note: There are 2 options: run multiple rebuilds in parallel or a single sequential rebuild as detailed below.

1 - disable deploymentclient to prevent new corruption

(until the fix to SPL-39127: targeted for the upcoming maintenance release 4.2.2)



mv $SPLUNK_HOME/etc/system/local/deploymentclient.conf $SPLUNK_HOME/etc/system/local/deploymentclient.disabled

2 - collect the list of the corrupted buckets with

cd $SPLUNK_HOME/bin 

./splunk cmd splunkd fsck --mode metadata --all > /tmp/trash

the buckets with errors will be displayed on the screen
by example : NEEDS REPAIR: file='/opt/splunk/var/lib/splunk/java/db/db_1303835244_1303775919_106/Hosts.data' code=25 contains recover-padding

3 - stop splunk to prevent bucket rotation

4 - for each of them rebuild the tsdix files
the process is long, if you have several buckets, it is faster to run several rebuild in parallel (use & on linux)



./splunk cmd splunkd rebuild /pathtothebucketfolder/

For parallel processing



./splunk cmd splunkd rebuild /pathtothebucketfolder1/ &

./splunk cmd splunkd rebuild /pathtothebucketfolder2/ &

etc...

OR to run a single command to rebuild all sequentially (takes longer time):



./splunk cmd splunkd fsck --mode metadata --all --repair

5 - check the result with



./splunk cmd splunkd fsck --mode metadata --all

6 - restart splunk (it will also apply the modification to the deploymentclient config)

For further information on splunkd fsck refer on the Community Wiki to:

http:///www.splunk.com/wiki/Check_and_Repair_Metadata

How to prevent this from happening until 4.2.2 comes out?

There are two workarounds to address this.

The workaround for the associated bug SPL-38464 (setting "inPlaceUpdates = false" as a global parameter in the [default] stanza of indexes.conf) is still a valid one :
```
[default]

inPlaceUpdates = false
```
Since we would always atomically update the metadata files via rename, there is no chance of corruption here. There is a chance, perhaps, of ending up with somewhat invalid metadata info, but not with corruption.
Another workaround is to set both "restartSplunkWeb=false" AND "restartSplunkd=false" in their serverclass.conf stanzas to disable restarts. The corruption happens in the splunkweb restart code path, but restarting splunkd also triggers splunkweb restart.

If applied, these work-arounds should be retired once 4.2.2 is installed.

View solution in original post

yannK · ‎05-10-2011