Getting Data In

Indexing compressed vs raw logs

lloydknight
Builder

Hello,

Say for example a five 50MB sample.log.gz (250MB) and if decompressed, it becomes five 600MB (3GB) sample.log and these are being indexed per day.

What is the best way to index logs in this case? compressed or just raw logs?
And what's the difference in terms of speed and impact with default UF thruput?

Much appreciated!

0 Karma
1 Solution

woodcock
Esteemed Legend

The archiver (AKA AQ or AEQ) process is single-threaded so if you go with the compressed option (which I suggest that you don't), you will either have horrifically slow forwarding (possibly so slow that you cannot process 1 file before more than 1 new one shows up and you will never catch up) with very under-utilized CPU OR you will need to partition the compressed files in such a way that you can stand up multiple splunk forwarder instances on the UF where each instance handles his privately segregated portion of the workload. A better option is to use the batch (instead of monitor) option with the and sinkhole setting to delete the uncompressed files as they are forwarded.

View solution in original post

0 Karma

woodcock
Esteemed Legend

The archiver (AKA AQ or AEQ) process is single-threaded so if you go with the compressed option (which I suggest that you don't), you will either have horrifically slow forwarding (possibly so slow that you cannot process 1 file before more than 1 new one shows up and you will never catch up) with very under-utilized CPU OR you will need to partition the compressed files in such a way that you can stand up multiple splunk forwarder instances on the UF where each instance handles his privately segregated portion of the workload. A better option is to use the batch (instead of monitor) option with the and sinkhole setting to delete the uncompressed files as they are forwarded.

0 Karma

lloydknight
Builder

Hello woodcock, so what's the difference between the compressed and non-compressed (same file) if indexed? Will the compressed be faster vice versa?

Because this is the current setup and you're right, the logs cannot catch up anymore. All I can do here is give possible recommendations on how to optimize the current setup.

0 Karma

woodcock
Esteemed Legend

There is no difference whatsoever once the data is Indexed; there is no change to the Indexer (either coming in for forwarding or going out for searching), nor for the Search Head. There is only a difference on the Forwarder and it is a Huuuuuuuuuuuuuuuge difference. You can fix it with dozens of Splunk instances running on the forwarder but it is a hassle. We have done it for clients before and can help you set it up if you need it but non-compressed is the way to go.

0 Karma

lloydknight
Builder

So basically, you're saying that the UF is having a haaaaard time with the uncompressed logs, given the size, to monitor and send it to the Indexer? Btw the setup here is the UF sends these large files to a 1 HF and forwards it to the Indexers. So basically, the bottleneck here is the Forwarder itself right? What if I change the thruput of the Forwarder aside from replacing monitor to batch, will this help? Thanks

0 Karma

woodcock
Esteemed Legend

The bottleneck is the process that uncompresses the files. It has 1 door in, 1 worker (CPU core), and 1 door out; it is terrible. It "works" but it is guaranteed to be "too slow" in any enterprise situation. The situation is even worse if you are sending compressed files to an aggregation point instead of forwarding them where they started! Even if the files were small, you would STILL be having this problem because AQ does not scale because it is single-threaded.

0 Karma

lloydknight
Builder

Hello woodcock, that doesn't seem the case here. The compressed files were properly indexed base on the retrieved events and upon the logs changed to uncompressed files, it seemed as the indexing stopped due to little to events were being indexed.

0 Karma

woodcock
Esteemed Legend

You have to whitelist the non- .gz filenames, obviously. Fix your inputs.conf so that it refers to the different/new filename endings.

0 Karma

lloydknight
Builder

okay that was trivial. lol thank you.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...