Getting Data In

Sourcetype when input is mixture of text files and .gz files

swatishs
Explorer

I am providing a directory for Splunk to index. In this directory, there are both text log files as well as gzipped log files(.gz). The gzipped log files are the older logs compressed to save space.
But while indexing, splunkd.log has many warnings like "Breaking event because limit of 256 has been exceeded - data_source=<.gz file name>". This leads to drop in overall indexing rate as parsing stage itself is taking longer than expected.
How to mitigate this issue? Any extra configuration needed in props.conf to support heterogeneous input types?

0 Karma

nickhills
Ultra Champion

In an ideal world you don't want to index .gz rolled logs.
Your archived gz files include the historic logs which have rolled over, but in a working system you will already have those logs files indexed when they were 'new' and written to 'your_log.log'
Specifically indexing the gz files will likely give you duplicates.
Your monitor stanza should therefore be specific to only index the .log files and not the .gz versions (or blacklist them)

There is however a caveat - that being when you first install Splunk!
Its quite conceivable that you want to import your old archived logs when you first install Splunk - you could configure a ST for the GZ files, OR simply extract the original logs, and let it ingest those.

If my comment helps, please give it a thumbs up!
0 Karma

swatishs
Explorer

Okay. But my concern is why Splunk isn't recognizing the GZ files amongst TXT files and indexing accordingly? Splunk documentation says that it supports GZ files as well.

0 Karma

nickhills
Ultra Champion

As long as the files in the gz are the same format (and your event breaking is perfect on the text .log files) you shouldn't have any issues indexing archived logs with the same sourcetype.
It will not do this multi-threaded, so it will take a significant period longer to index archived files than flat text (which would index in parallel).
If you are receiving breaking warnings on the gz files, you probably are getting them on the .log files too, you just may not have noticed.

If my comment helps, please give it a thumbs up!
0 Karma

swatishs
Explorer

I cross checked that. The warning is only in GZ files. Do I need to specify a separate stanza in props.conf to handle GZ files amongst TXT files? And can we do this?

0 Karma

nickhills
Ultra Champion

It’s odd, that shouldn’t be necessary if they are detected as the same sourcetype. I presume they are indexed as the same sourcetype, but just a different source, and your props is applied to the source type, and not the source?

If my comment helps, please give it a thumbs up!
0 Karma

lakshman239
SplunkTrust
SplunkTrust

Ideally it would be good practice to store gz/achieved files in a different folder or disk mount. If that cannot be done and if you want to not index the gz files, add a blacklist in your input stanza for .gz files. This will help index only recent files and help in better indexing rate and avoid possible duplicate events.

0 Karma

swatishs
Explorer

Its a directory I added for indexing, which already contains both recent and rolled over logs. And since Splunk hadn't indexed any data yet, it won't cause duplicate issue.
But my query is, how do we index data which is a mixture of different types of files. Is segregation necessary? And do we need to add a stanza to handle directory containing gz files?

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...