Solved: Splunk crashes repeatedly (Cannot open manifest fi...

iKate · ‎11-15-2013

Splunk started crushing with crash logs enries like this:

[build 182037] 2013-11-14 11:02:27
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 4283 running under UID 0.
 Crashing thread: archivereader
 Registers:
    RIP:  [0x00007F4DC9918B25] gsignal + 53 (/lib/libc.so.6)
    RDI:  [0x00000000000010BB]
    RSI:  [0x00000000000010F1]
    RBP:  [0x00007F4DC9A2E74C]
    RSP:  [0x00007F4DB7BEE068]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007FFF8B497801]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x00007F4DB7BFF700]
    R9:  [0x00007F4DC9A306B4]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000206]
    R12:  [0x00000000012747F9]
    R13:  [0x00000000013871A0]
    R14:  [0x00007F4DC9A2E74C]
    R15:  [0x00000000000006DC]
    EFL:  [0x0000000000000206]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]
 OS: Linux
 Arch: x86-64
 Backtrace:
  [0x00007F4DC9918B25] gsignal + 53 (/lib/libc.so.6)
  [0x00007F4DC991C670] abort + 384 (/lib/libc.so.6)
  [0x00007F4DC99119F1] __assert_fail + 241 (/lib/libc.so.6)
  [0x0000000000D025F3] _ZN14PolledReadPipeD2Ev + 147 (splunkd)
  [0x0000000000D4D261] _ZN12PipeToLoggerD2Ev + 97 (splunkd)
  [0x0000000000AA0880] _ZN14ArchiveContext7processERK8PathnameP13ISourceWriter + 1216 (splunkd)
  [0x0000000000AA0E95] _ZN14ArchiveContext9readFullyEP13ISourceWriterRb + 1221 (splunkd)
  [0x000000000083CFA2] _ZN16ArchiveProcessor20haveReadAsNonArchiveE14FileDescriptorlPK3Str + 578 (splunkd)
  [0x000000000083EE53] _ZN16ArchiveProcessor4mainEv + 2755 (splunkd)
  [0x0000000000D81A2D] _ZN6Thread8callMainEPv + 61 (splunkd)
  [0x00007F4DC9C719CA] ? (/lib/libpthread.so.0)
  [0x00007F4DC99CE21D] clone + 109 (/lib/libc.so.6)
 Linux / css-prod-back.scartel.dc / 2.6.32-45-server / #99-Ubuntu SMP Tue Oct 16 16:41:38 UTC 2012 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    2013-11-14 11:00:55.156 +0400 splunkd started (build 182037)
    Cannot open manifest file inside "/opt/splunk/var/lib/splunk/audit/db/db_1384412390_1384412390_623/rawdata": No such file or directory
    splunkd: /opt/splunk/p4/splunk/branches/6.0.0/src/util/EventLoop.cpp:1756: virtual PolledReadPipe::~PolledReadPipe(): Assertion `!isActive()' failed.
    2013-11-14 11:02:25.275 +0400 splunkd started (build 182037)
    Cannot open manifest file inside "/opt/splunk/var/lib/splunk/audit/db/db_1384412455_1384412455_624/rawdata": No such file or directory
    splunkd: /opt/splunk/p4/splunk/branches/6.0.0/src/util/EventLoop.cpp:1756: virtual PolledReadPipe::~PolledReadPipe(): Assertion `!isActive()' failed.
 /etc/debian_version: squeeze/sid
 glibc version: 2.11.1
 glibc release: stable
Last errno: 11
Threads running: 40
argv: [splunkd -p 8089 start]
Thread: "archivereader", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f4db7c2b330:
00000000  00 f7 bf b7 4d 7f 00 00                           |....M...|
00000008
x86 CPUID registers:
         0: 0000000A 756E6547 6C65746E 49656E69
         1: 00010676 04040800 000CE3BD BFEBFBFF
         2: 05B0B101 005657F0 00000000 2CB4304E
         3: 00000000 00000000 00000000 00000000
         4: 00000000 00000000 00000000 00000000
         5: 00000040 00000040 00000003 00002220
         6: 00000001 00000002 00000001 00000000
         7: 00000000 00000000 00000000 00000000
         8: 00000400 00000000 00000000 00000000
         9: 00000000 00000000 00000000 00000000
         A: 07280202 00000000 00000000 00000503
  80000000: 80000008 00000000 00000000 00000000
  80000001: 00000000 00000000 00000001 20000800
  80000002: 65746E49 2952286C 6F655820 2952286E
  80000003: 55504320 20202020 20202020 45202020
  80000004: 30353435 20402020 30302E33 007A4847
  80000005: 00000000 00000000 00000000 00000000
  80000006: 00000000 00000000 18008040 00000000
  80000007: 00000000 00000000 00000000 00000000
  80000008: 00003026 00000000 00000000 00000000
terminating...

We've already tried repair with splunk-fsck but no result and in fact there were no corrupted indexes.
The directory that is stated in the crash-message exists, but not sure about manifest file.
Here's what in stated directory:

root@css-prod-back:/opt/splunk/bin# ls /opt/splunk/var/lib/splunk/audit/db/db_1384412390_1384412390_623/rawdata/ 
0  slicesv2.dat

And here's another directory that splunk doesn't complain at:

root@css-prod-back:/opt/splunk/bin# ls /opt/splunk/var/lib/splunk/audit/db/db_1351635002_1351550304_461/rawdata/ 
journal.gz    slicemin.dat  slices.dat    slicesv2.dat

Seems it's the same problem like here: http://answers.splunk.com/answers/108806/splunkd-keeps-on-crashing-crashing-thread-archivereader?pag...
And unfourtunately it's not square bracket case like here: http://answers.splunk.com/answers/110135/splunk-crashing

So how to rebuid this "missing" manifest file (if the reason in it really)?
What is its exact name?
Why could this happen?

Thanks in advance for help, it is vital for us.

lukejadamec · ‎11-15-2013

To rebuild the bucket manifest file, you can stop splunkd, delete the file (.bucketmanifest located in the db directory), and restart splunkd.

BUT, it looks like you may have a bigger problem. According to your post, the db_1384412390_1384412390_623/rawdata/ bucket does not contain a journal.gz. The journal file is the one that contains the raw data. Without that file, the bucket is basically empty.

From the journal.gz you can rebuild (not just repair) a bucket with the rebuild command (remove everything Except the journal.gz and run rebuild on the bucket). Again, without the journal.gz file there is nothing to rebuild. Try to find the missing journal.gz file. You can move the bad buckets out of the db directory to start splunk without errors.

UPDATE - my bucket rebuild procedure

If you have the journal.gz file, when Repair fails then what I have done in the past is to Rebuild them one at a time:

In Windows explorer go to the db folder for the index that contains the bucket. For the main database (default) the db directory is located here:

…splunk/var/lib/splunk/defaultdb/db

Inside that folder you will see the buckets, and the sequence number will be the last number in the file name. The first number is the earliest records epoch time, and the second number is the oldest records epoch time contained in that bucket.

Create a new directory in the db directory called repair_db.

Stop splunk from the command line with …splunk/bin/splunk.exe stop

Move the bad bucket to the repair db.

Open the bucket and delete all files except “journal.gz”, but be sure not to delete the rawdata folder that contains the journal.gz

From the command line enter the following to rebuild the bucket. In this example we are rebuilding bucket “db_1368133248_1366849789_398”:

…splunk/bin/splunk.exe rebuild /full-path-to-the-bucket/db_1368133248_1366849789_398

When the rebuild is complete you can move the bucket back to the original db directory, and restart splunk.

Note: once the bucket is moved to the repair_db folder splunk can be restarted. It will not attempt to access the repair_db folder, so all data in that folder’s buckets will not be available to splunk. Buckets may be returned to the db folder while splunk is running but it will take some time for splunk to see them, so it is best to restart splunk after they have been returned to the db folder.

View solution in original post

iKate · ‎12-20-2013

It turned out that this was a known issue and was fixed in 6.0.1

"Upgrading causes crash in "Crashing Thread: archivereader" (SPL-74873)"

lukejadamec · ‎11-15-2013

To rebuild the bucket manifest file, you can stop splunkd, delete the file (.bucketmanifest located in the db directory), and restart splunkd.

BUT, it looks like you may have a bigger problem. According to your post, the db_1384412390_1384412390_623/rawdata/ bucket does not contain a journal.gz. The journal file is the one that contains the raw data. Without that file, the bucket is basically empty.

From the journal.gz you can rebuild (not just repair) a bucket with the rebuild command (remove everything Except the journal.gz and run rebuild on the bucket). Again, without the journal.gz file there is nothing to rebuild. Try to find the missing journal.gz file. You can move the bad buckets out of the db directory to start splunk without errors.

UPDATE - my bucket rebuild procedure

If you have the journal.gz file, when Repair fails then what I have done in the past is to Rebuild them one at a time:

In Windows explorer go to the db folder for the index that contains the bucket. For the main database (default) the db directory is located here:

…splunk/var/lib/splunk/defaultdb/db

Inside that folder you will see the buckets, and the sequence number will be the last number in the file name. The first number is the earliest records epoch time, and the second number is the oldest records epoch time contained in that bucket.

Create a new directory in the db directory called repair_db.

Stop splunk from the command line with …splunk/bin/splunk.exe stop

Move the bad bucket to the repair db.

Open the bucket and delete all files except “journal.gz”, but be sure not to delete the rawdata folder that contains the journal.gz

From the command line enter the following to rebuild the bucket. In this example we are rebuilding bucket “db_1368133248_1366849789_398”:

…splunk/bin/splunk.exe rebuild /full-path-to-the-bucket/db_1368133248_1366849789_398

When the rebuild is complete you can move the bucket back to the original db directory, and restart splunk.

Note: once the bucket is moved to the repair_db folder splunk can be restarted. It will not attempt to access the repair_db folder, so all data in that folder’s buckets will not be available to splunk. Buckets may be returned to the db folder while splunk is running but it will take some time for splunk to see them, so it is best to restart splunk after they have been returned to the db folder.

lukejadamec · ‎11-21-2013

I updated the answer with my rebuild procedure. Good luck.
If it fails, I'd be curious to know what errors you get.
Lastly, if the journal.gz is truly corrupt, then according to what I've heard from Splunk support there is no way to fix it. The data would need to be reindexed.

iKate · ‎11-21-2013

@lukejadamec thank you for the answer! We've tried your radical surgery treatment with elimination of ill buckets and it works. At least there were no more crushes.
Nevertheless are you aware of the way to restore journal.gz? We've found no missing journals and all in all there were 42 corrupted buckets.

Splunk crashes repeatedly (Cannot open manifest file...)

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!