Deployment Architecture

Backup Index 'rawdata' only (exclude 'index files')

carsonl
Explorer

Hi All,

If i wanted to only backup the rawdata, and exclude the 'index files', is it just as easy as excluding *.tsidx, or do I need to do more?

Assuming that when you restore it, it'll go "oh, I don't have those 'index files', let me rebuild them for you" (if this isn't automatic, and I need to issue a command, that is fine, just tell me what to do! - I figure it would be, as index replication handles the creation of 'index files' by itself...)

Some context:
Our backup guy is telling me my Splunk systems are the largest users of capacity, so I'm seeing what I can do to reduce the backup size. If there is nothing, so be it, but I'd like to know my options.

I have a clustered environment running Splunk 5.0.4 (4 indexers with rep and search factor of 4), so the chance of a restore being required is very low, but we obviously still need backups.

I am happy to accept the delay of service restoration while Splunk rebuilds the 'index files'.

It sounds like it is possible, as hinted at under: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Backupindexeddata

From the above link "Another thing to consider when designing a cluster backup script is whether you want to back up just the bucket's rawdata or both its rawdata and index files. If the latter, the script must also identify a searchable copy of each bucket."

Thanks,

Carson.

Tags (2)
0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

The minimum to back up and be able to restore/rebuild your data is to back up the index/db*/rawdata/journal.gz files, and the contents of the index/db*/rawdata/deletes/ directories. Other data, including the tsidx files can be reconstructed from this, though it will take time and CPU to do so.

You should note that a "rep factor" that is higher than the "search factor" will simply keep only the minimal files as well.

In addition however to the tsidx files, which can be rebuilt by issuing an index rebuild command, you could also

View solution in original post

carsonl
Explorer

Yeah, aware of that, it is even with 2 in each DC, hence 3 could be okay for me, but for completeness sake, I've chosen 4.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Hopefully you're aware that you can only be guaranteed 2 searchable copies at each of 2 sites if you only have 2 indexer nodes in the cluster at each site, since Splunk replication in the current version is note site-aware. If you have 3 or more nodes at one site, it is possible for 3 or more copies to be at the same site.

0 Karma

carsonl
Explorer

Against Splunk advise, I'm doing replication across the WAN (My WAN link is 600Mbps with ~25ms latency, hence going against their advise). I wanted to ensure that I have 2 searchable copies in each DC to ensure everything is okay if there is a link outage + server failure at the same time.

You're right, I could probably drop the search/rep factor to 3, and still be okay, but disk and processing is still comparatively cheap compared to downtime.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

The minimum to back up and be able to restore/rebuild your data is to back up the index/db*/rawdata/journal.gz files, and the contents of the index/db*/rawdata/deletes/ directories. Other data, including the tsidx files can be reconstructed from this, though it will take time and CPU to do so.

You should note that a "rep factor" that is higher than the "search factor" will simply keep only the minimal files as well.

In addition however to the tsidx files, which can be rebuilt by issuing an index rebuild command, you could also

carsonl
Explorer

Perfect, thank you.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

If you restore back to a cluster that is needs to recreate its search factor then it should get rebuilt automatically. But if you restore to a standalone node, you need to execute a rebuild on each bucket. The extra files should not cause any problems.

0 Karma

carsonl
Explorer

To be clear... excluding *.tsidx will result in those files being recreated... Is that automatically, or only when the rebuild command is run? (So I can update my restore documentation)

Also, it would be much more reliable to exclude *.tsidx using the backup agent... leaving the other files won't cause any problems? (Other files being: bloomfilter bucket_info.csv Hosts.data merged_lexicon.lex optimize.result Sources.data SourceTypes.data splunk-autogen-params.dat Strings.data)

0 Karma

rturk
Builder

Is there any reason in particular you want/need an index replication AND a search factor of 4? That seems a bit on the excessive side, and there may be more efficient ways to give you the redundancy/resiliency you're after (while keeping storage volumes down).

Just thought I'd get some more info before I provide a (possible) answer 🙂

Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...