Knowledge Management

summarizing _raw data in an index to reduce index size

sonicZ
Contributor

Our co. has been gathering auditd logs since last summer now our Splunk infrastructure is getting very fat on the indexed auditd data. I cant delete this data either since we require it for audits.
The solution i was coming up with was to start summarizing the _raw data from some other previous
examples i've seen

index=audit | dedup _raw | rename _raw as orig_raw

Then verifying the summarized results vs indexed results and expiring data off colddb sooner then it is now.
Is there a better solution out there? the main goal is to reduce index disk usage.

0 Karma
1 Solution

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

View solution in original post

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

lguinn2
Legend

You shouldn't need the coldToFrozenScript. Just make sure that the "Frozen archive path" is set to a real directory. Splunk will automatically strip off everything it can when it puts the compressed data into that directory.

In indexes.conf the frozen archive path is set like this:

coldToFrozenDir = <path to frozen archive>

Note that the path cannot contain a volume reference.

0 Karma

sonicZ
Contributor

The data does not need to be searchable, retrievable upon request would work for us.
I've always used Cold to frozen as our delete mechanism, I suppose i'll have to use the coldToFrozenScript.

Is the default $SPLUNK_HOME/bin/coldToFrozenExample.py the script that will convert buckets to 30% of their normal size?

0 Karma
Get Updates on the Splunk Community!

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

Tuesday, May 14, 2024  |  11AM PT / 2PM ET Register to Attend Join us for this Tech Talk and learn how to ...

.conf24 | Personalize your .conf experience with Learning Paths!

Personalize your .conf24 Experience Learning paths allow you to level up your skill sets and dive deeper ...

Threat Hunting Unlocked: How to Uplevel Your Threat Hunting With the PEAK Framework ...

WATCH NOWAs AI starts tackling low level alerts, it's more critical than ever to uplevel your threat hunting ...