Deployment Architecture

Corrupted bucket journal?

clindseyssi
Engager

Hi Everyone! I hope this isn't a "frequently solved problem." I've searched and googled for answers but I ran into a wall.

First, I started getting this error in Splunk web:

[EventsViewer module] Error in 'databasePartitionPolicy': Failed to read 1 event(s) from rawdata in bucket 'main~35~073974E4-ED0F-432A-8DF5-3AB3DE83D4ED'. Rawdata may be corrupt, see search.log

Hmmmm. So I googled and found in the answers forum a link that told me how to run fsck against the bucket. And I did. Here is the result:

$ sudo /Applications/splunk/bin/splunk stop
$ sudo /Applications/splunk/bin/splunk fsck --all
bucket=/Applications/splunk/var/lib/splunk/audit/db/db_1360792166_1360340101_24 NEEDS REPAIR: count mismatch tsidx=0 slices.dat=6088
bucket=/Applications/splunk/var/lib/splunk/defaultdb/db/db_1360792158_1359732196_28 NEEDS REPAIR: count mismatch tsidx=36837 slices.dat=38544

SUMMARY: We have detected 2 buckets (877515 bytes of compressed rawdata) need rebuilding.
    Depending on the speed of your server, this may take from 0 to 1 minutes.  You can use the --repair option to fix

So I added the --repair switch. And this is that result:

$ sudo /Applications/splunk/bin/splunk fsck --all --repair
bucket=/Applications/splunk/var/lib/splunk/_internaldb/db/db_1364229909_1363960207_40 count mismatch tsidx=524223 source-metadata=524228, repairing...
    bucket=/Applications/splunk/var/lib/splunk/_internaldb/db/db_1364229909_1363960207_40 rebuild failed: caught exception while rebuilding: Error reading compressed journal while streaming: bad gzip header, provider=/Applications/splunk/var/lib/splunk/_internaldb/db/db_1364229909_1363960207_40/rawdata/journal.gz

I searched the forum and google for the next steps but didn't find anything useful. Has anyone else seen something like this? Were you able to resolve it?

Any help, as always, is appreciated.

Labels (1)

Sevjer13
New Member

Splunk disable and enable seems not to work on clusters. Seems only way to do it is to "move" the index. I do not like this as it means we are losing unknown data. Possible exploit here?

0 Karma

anwarmian
Communicator

In a cluster environment, you should already have a copy of the searchable bucket in another indexer provided you have at least SF=2.

  1. Enable the indexer cluster maintenance mode
  2. Stop the indexer in question 3. a. Move the broken journal file away (while splunkd turned off) to another place or b. Delete the bucket
  3. Start the indexer in question
  4. Disable the indexer cluster maintenance mode.
0 Karma

Sevjer13
New Member

Note this does not work on Clusters, only fix I found was to stop splunk and move the file away. I do not like that as it means your losing data.

0 Karma

lmyrefelt
Builder

Hi I would just like to confirm that MikaelSandquist solution Works 🙂

This is what you would like to do;
1. download the search.log (via jobb-inspector) from the node that fails / that have the corrupted jornal / rawdata.
2. locate the bucket that is corrupt
3. stop splunk on that node
4. run splunk cmd splunkd fsck --all --repair
5. run splunk cmd splunkd rebuild /path/to/Your/failed/db/bucket (found in search.log)
6. List item
7. splunk disable index "nameOfIndex"
9. splunk enable index "nameOfIndex"

In my case both the rebuild and repair failed to correct the issue however disabled and enable the index seems to have solved the issue.

Seems splunk is re-creating the jornal file ? or just roll It ?

Hope this will help 🙂

mikaelsandquist
Explorer

I solved it by disable the index that had a damaged journal file from cli:

/opt/splunk/bin/splunk disable index name_of_your_index

I started splunk up and enabled the index from the web gui and restarted splunk to see if it starts ok without errors. Looks like splunk removed the broken journal file during that process.

Another suggestion that I got from Splunk Support was to just move the broken journal file away (while splunkd turned off) to another place and then start splunk.

lukejadamec
Super Champion

The -repair command runs behind the scenes automatically. It does not 'repair everything at startup'. It does so gradually over time. You can run the repair routine manually, but that never seems to work for me. I prefer to rebuild 'bad' buckets. Also, if the journal is truly corrupt, then it cannot be repaired. Splunk cannot manipulate the journal data. See the troubleshooting section at the bottom of this doc: http://docs.splunk.com/Documentation/Splunk/6.0.1/Indexer/HowSplunkstoresindexes

0 Karma

campbellj1977
Explorer

Is FSCK supposed to run automatically when splunk is restarted? I am guessing that the restart alone did not work for you?

I am having the same problem, but the service restart did not run the fsck --repair

0 Karma

mikaelsandquist
Explorer

I have encountered the same problem today.

0 Karma

clindseyssi
Engager

Hi I thought I'd give this a bump and see if anyone had any thoughts on this..

Thanks!

Craig

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...