Getting Data In

Missing data - Splunk is showing random gaps in the 'indexed data' timeline and safeService warning in Splunkd.log

mctester
Communicator

My Splunk instance is constantly indexing data 24*7, but I've noticed some gaps in the indexed data timeline recently. I have also noticed that data I could search on yesterday is not being returned today. This doesn't happen consistently, but regularly enough to cause concern. I looked in splunkd.log and index=_internal to ensure that the buckets have not rotated out of the DB, and also confirmed that the buckets spanning the time period of the gap are present and in good shape. What else can I do to track down this missing data?

In splunkd.log I see the following:

05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata

My environment consists of 4 indexers running 4.2, 300 UF instances (also 4.2) and a standalone deployment server, also 4.2. We use the deployment server to manage the configs of all instances.

1 Solution

yannK
Splunk Employee
Splunk Employee

Hi this may be caused by defect SPL-39127 on Splunk 4.2.0 and 4.2.1.
This is caused by a push from the deployment server which restarts Splunkweb on indexers (includes search heads that are performing summary indexing) that are deployment clients.

If Splunk is still at 4.2, first apply the latest 4.2.1 release which fixes an associated defect SPL-38464 where in rare cases, concurrent hash table and string length collisions for metadata field values can cause index-level metadata files to grow to very large sizes, up to several gigabytes.

Reference: http://splunk.com/base/Documentation/4.2/ReleaseNotes/Knownissues

If you encounter this problem, please file a case at splunk support.
http://www.splunk.com/support

To find if this is the case, search in the splunkd.log logs look for something like :

05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata

To find those errors in the internal logs, (and the indexer in case of search-peers), you can use this search :


index=_internal host="indexer hostname" source=splunkd.log safeService | rex " MetaData - (?P.*)/" | stats count by bucket splunk_server

Here is the manual procedure to fix.

Note: There are 2 options: run multiple rebuilds in parallel or a single sequential rebuild as detailed below.

1 - disable deploymentclient to prevent new corruption

(until the fix to SPL-39127: targeted for the upcoming maintenance release 4.2.2)


mv $SPLUNK_HOME/etc/system/local/deploymentclient.conf $SPLUNK_HOME/etc/system/local/deploymentclient.disabled

2 - collect the list of the corrupted buckets with
cd $SPLUNK_HOME/bin 
./splunk cmd splunkd fsck --mode metadata --all > /tmp/trash

the buckets with errors will be displayed on the screen
by example : NEEDS REPAIR: file='/opt/splunk/var/lib/splunk/java/db/db_1303835244_1303775919_106/Hosts.data' code=25 contains recover-padding

3 - stop splunk to prevent bucket rotation

4 - for each of them rebuild the tsdix files
the process is long, if you have several buckets, it is faster to run several rebuild in parallel (use & on linux)


./splunk cmd splunkd rebuild /pathtothebucketfolder/

For parallel processing

./splunk cmd splunkd rebuild /pathtothebucketfolder1/ &
./splunk cmd splunkd rebuild /pathtothebucketfolder2/ &
etc...

OR to run a single command to rebuild all sequentially (takes longer time):

./splunk cmd splunkd fsck --mode metadata --all --repair

5 - check the result with


./splunk cmd splunkd fsck --mode metadata --all

6 - restart splunk (it will also apply the modification to the deploymentclient config)

For further information on splunkd fsck refer on the Community Wiki to:

http:///www.splunk.com/wiki/Check_and_Repair_Metadata

How to prevent this from happening until 4.2.2 comes out?

There are two workarounds to address this.

  • The workaround for the associated bug SPL-38464 (setting "inPlaceUpdates = false" as a global parameter in the [default] stanza of indexes.conf) is still a valid one :

    [default]
    inPlaceUpdates = false

    Since we would always atomically update the metadata files via rename, there is no chance of corruption here. There is a chance, perhaps, of ending up with somewhat invalid metadata info, but not with corruption.

  • Another workaround is to set both "restartSplunkWeb=false" AND "restartSplunkd=false" in their serverclass.conf stanzas to disable restarts. The corruption happens in the splunkweb restart code path, but restarting splunkd also triggers splunkweb restart.

If applied, these work-arounds should be retired once 4.2.2 is installed.

View solution in original post

yannK
Splunk Employee
Splunk Employee

Hi this may be caused by defect SPL-39127 on Splunk 4.2.0 and 4.2.1.
This is caused by a push from the deployment server which restarts Splunkweb on indexers (includes search heads that are performing summary indexing) that are deployment clients.

If Splunk is still at 4.2, first apply the latest 4.2.1 release which fixes an associated defect SPL-38464 where in rare cases, concurrent hash table and string length collisions for metadata field values can cause index-level metadata files to grow to very large sizes, up to several gigabytes.

Reference: http://splunk.com/base/Documentation/4.2/ReleaseNotes/Knownissues

If you encounter this problem, please file a case at splunk support.
http://www.splunk.com/support

To find if this is the case, search in the splunkd.log logs look for something like :

05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata

To find those errors in the internal logs, (and the indexer in case of search-peers), you can use this search :


index=_internal host="indexer hostname" source=splunkd.log safeService | rex " MetaData - (?P.*)/" | stats count by bucket splunk_server

Here is the manual procedure to fix.

Note: There are 2 options: run multiple rebuilds in parallel or a single sequential rebuild as detailed below.

1 - disable deploymentclient to prevent new corruption

(until the fix to SPL-39127: targeted for the upcoming maintenance release 4.2.2)


mv $SPLUNK_HOME/etc/system/local/deploymentclient.conf $SPLUNK_HOME/etc/system/local/deploymentclient.disabled

2 - collect the list of the corrupted buckets with
cd $SPLUNK_HOME/bin 
./splunk cmd splunkd fsck --mode metadata --all > /tmp/trash

the buckets with errors will be displayed on the screen
by example : NEEDS REPAIR: file='/opt/splunk/var/lib/splunk/java/db/db_1303835244_1303775919_106/Hosts.data' code=25 contains recover-padding

3 - stop splunk to prevent bucket rotation

4 - for each of them rebuild the tsdix files
the process is long, if you have several buckets, it is faster to run several rebuild in parallel (use & on linux)


./splunk cmd splunkd rebuild /pathtothebucketfolder/

For parallel processing

./splunk cmd splunkd rebuild /pathtothebucketfolder1/ &
./splunk cmd splunkd rebuild /pathtothebucketfolder2/ &
etc...

OR to run a single command to rebuild all sequentially (takes longer time):

./splunk cmd splunkd fsck --mode metadata --all --repair

5 - check the result with


./splunk cmd splunkd fsck --mode metadata --all

6 - restart splunk (it will also apply the modification to the deploymentclient config)

For further information on splunkd fsck refer on the Community Wiki to:

http:///www.splunk.com/wiki/Check_and_Repair_Metadata

How to prevent this from happening until 4.2.2 comes out?

There are two workarounds to address this.

  • The workaround for the associated bug SPL-38464 (setting "inPlaceUpdates = false" as a global parameter in the [default] stanza of indexes.conf) is still a valid one :

    [default]
    inPlaceUpdates = false

    Since we would always atomically update the metadata files via rename, there is no chance of corruption here. There is a chance, perhaps, of ending up with somewhat invalid metadata info, but not with corruption.

  • Another workaround is to set both "restartSplunkWeb=false" AND "restartSplunkd=false" in their serverclass.conf stanzas to disable restarts. The corruption happens in the splunkweb restart code path, but restarting splunkd also triggers splunkweb restart.

If applied, these work-arounds should be retired once 4.2.2 is installed.

tpsplunk
Communicator

if i upgrade to 4.2.2, do I still need to run the rebuild/repair operations?

0 Karma

Simeon
Splunk Employee
Splunk Employee

If you have forwarders sending data, you can look for forwarder connectivity within the splunkd.log of both the indexers and forwarders. I would first check to make sure the forwarder indeed had connectivity during that time. Are these systems picking up network data or monitoring files? Some keys to debugging:

  • Figure out exactly what source, sourcetype, or host is missing data. Use searches to find them.
  • Compare the actual raw data to the internal indexing volume. Does indexing volume tail off for a specific forwarder or index?
  • Search for "index=_internal source=metrics.log blocked". If something is blocked, that might be the problem.

The above steps are typically enough to figure out if it is a problem getting the data, or indexing the data.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...