Knowledge Management

summarizing _raw data in an index to reduce index size

sonicZ
Contributor

Our co. has been gathering auditd logs since last summer now our Splunk infrastructure is getting very fat on the indexed auditd data. I cant delete this data either since we require it for audits.
The solution i was coming up with was to start summarizing the _raw data from some other previous
examples i've seen

index=audit | dedup _raw | rename _raw as orig_raw

Then verifying the summarized results vs indexed results and expiring data off colddb sooner then it is now.
Is there a better solution out there? the main goal is to reduce index disk usage.

0 Karma
1 Solution

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

View solution in original post

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

lguinn2
Legend

You shouldn't need the coldToFrozenScript. Just make sure that the "Frozen archive path" is set to a real directory. Splunk will automatically strip off everything it can when it puts the compressed data into that directory.

In indexes.conf the frozen archive path is set like this:

coldToFrozenDir = <path to frozen archive>

Note that the path cannot contain a volume reference.

0 Karma

sonicZ
Contributor

The data does not need to be searchable, retrievable upon request would work for us.
I've always used Cold to frozen as our delete mechanism, I suppose i'll have to use the coldToFrozenScript.

Is the default $SPLUNK_HOME/bin/coldToFrozenExample.py the script that will convert buckets to 30% of their normal size?

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...