Getting Data In

Why is the Splunk compression rate at 500% for 6GB of CSV data?

pduflot
Path Finder

Hi,

I have indexed 6GB of CSV data in Splunk. When I look at the compression rate using this search:

| dbinspect index=sca_rs_index2
 | fields state,id,rawSize,sizeOnDiskMB 
 | stats sum(rawSize) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB
 | eval rawTotalinMB=(rawTotal / 1024 / 1024) | fields - rawTotal
 | eval compression=tostring(round(diskTotalinMB / rawTotalinMB * 100, 2)) + "%"
 | table rawTotalinMB, diskTotalinMB, compression

It gives me 500%.
Indeed, when I execute du -h /opt/splunk/var/lib/splunk/sca_rs_index2, I got 25GB.
The CSV files contain mainly float data. So, I can understand that the compression ratio would not be so good, but not five times bigger. Here is a sample line of the indexed files:

156,Jun-25-2015 03:53:56:765 PM (CEST),4.46,4.24,-33.79,-36.85,2.9007191883900015E-9,2.9265995391399987E-9,7.07,12.803,2.184,2.128,1.7805797178279998E-9,1.8405463108899995E-9,-0.8554831,-1.4905629,-0.23367512,-0.04267813,-0.85057795,-1.4899217,-0.24520056,-0.03800104,2.9247010642799994E-9,2.91267178384E-9,1.3790839E-7,1.36082E-7,0.38341087,97,156,2.9231726E-9,0.02,2.8500933E-9,2.996252E-9,2.760363E-9,3.0509275E-9,2.6058726E-9,3.1849554E-9,1.3411523E-9,2.2089566E-9,-1.0E-8,1.9596868E-7,-1.0E-8,1.959884E-7,,,-0.009,0.55262375,-0.9074077,-0.8010149,-1.5971898,-1.379061,-0.5492179,0.022210669,-0.3409165,0.23051207,-0.90580165,-0.79919446,-1.5967877,-1.3784232,-0.5492179,0.022210669,-0.3409165,0.23051207,-35.85059,-31.78982,-39.128872,-34.6483,-2.9167533,7.0832467,-2.962398,7.037602,3.2690098,2.9473321,1.4245242,7.4673486,12.490728347227515

I have 5 accelerated reports on this data. There is a subfolder summary in /opt/splunk/var/lib/splunk/sca_rs_index2. Does it contain the summary information?

What could be improved? How can I investigate this further?

Thanks,

0 Karma
1 Solution

pduflot
Path Finder

As explained in the question, the inbound data size is 6GB and the space occupied on disk is 25GB. So, the ratio between the 2 is around 500% as well.

Replacing the INDEXED_EXTRACTIONS = CSV by a DELIM extraction in transforms.conf did significantly improved the situation:
- inbound data size: 6GB
- space occupied on disk: 8,5GB
- Ratio disk space / inbound data : 140%
- Ratio between index size and compressed data as returned by the query : 180%

View solution in original post

0 Karma

pduflot
Path Finder

As explained in the question, the inbound data size is 6GB and the space occupied on disk is 25GB. So, the ratio between the 2 is around 500% as well.

Replacing the INDEXED_EXTRACTIONS = CSV by a DELIM extraction in transforms.conf did significantly improved the situation:
- inbound data size: 6GB
- space occupied on disk: 8,5GB
- Ratio disk space / inbound data : 140%
- Ratio between index size and compressed data as returned by the query : 180%

0 Karma

lguinn2
Legend

First, the majority of the bucket is the keyword index for almost every typical input - Windows event logs, Linux logs, etc. Assume that you have the following

incoming data of 100 MB
raw data in bucket averages 15% of incoming data
index files in bucket averages 35% of incoming data

This is the common example given for "syslog-like" data. If you compare rawdata size to index file size in this example, the result is 233%
and that is the same comparison that your search does.

Your data is likely to generate much larger index files because of its contents, as you noted; there seems to be little commonality between "keywords." And using indexed fields will make the index files larger yet. So 500% seems totally realistic to me.

Remember that the documentation, course materials, etc. do not compare the compressed rawdata size of the bucket with the index file size of the bucket - we normally compare the inbound data size with the bucket size, etc.

So the real compression rate that you should care about is: given X amount of inbound data, how much disk space (rawdata and index files) will Splunk need?

pduflot
Path Finder

BTW, If I compress the indexed folder with tar, the size is 1,5GB.

0 Karma

pduflot
Path Finder

mmm... I discovered that INDEXED_EXTRACTIONS = CSV was set for this sourcetype. I need to investigate this.

0 Karma

bmacias84
Champion

That setting can be very very bad and while explode your index size.

Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...