Re: I have a question about raw data and index dat...

seksit · ‎01-17-2016

Hi friend,

I've a server and already install splunk. This server has many log file (tar.gz) that import from another server.

I would like to use splunk monitor this log via directory such as /var/log/2016/01, /var/log/2016/02.

If splunk monitoring the directory, splunk will store the raw data (double raw data) from log file?

Please help me to understand it.

Thank you

sorry for my english

esix_splunk · ‎01-17-2016

Seksit-
I think you should understand how Splunk processes and stores files. That should lead to a better understanding of whats going on and how it relates to your use case.

When you 'monitor' a file or directory, irregardless of if the file is manually copied or generated by an app, Splunk will read the files and index them. The indexing process take the 'raw' data and reads it in and performs various operations such as assigning sourcetypes, placing it in a defined index, extracting timestamps and hostnames. Files are written to buckets(files on disk) on the indexers, and associated metadata is created and stored with the buckets. When you search in Splunk, this is what is searched. Typically the indexed data is compressed as white space and unneeded characters are removed.

So with that in mind, once you have indexed the monitored files, they can be deleted or rotated out. Of course, you need to consider your retention and legal compliance policies if you can delete the files.

On another note, compressed files and Splunk are a sticky point. Splunk's unarchiving tool is single threaded. So when Splunk encounters a tar/zip/gzip/tgz file, it has to extract it before it can read it. If you are dealing with a lot of files at once, this will create a slow down on your system and use more memory.

renjith_nair · ‎01-17-2016

That's my understanding and that's what I was trying to convey to seksit's question as well. The question was not asked by me but seksit 🙂

---
What goes around comes around. If it helps, hit it with Karma 🙂

esix_splunk · ‎01-17-2016

Updated, misread the first commen!

Murali2888 · ‎01-17-2016

Hi seksit,

In your case, splunk will index the data from the log files ( present in the directory such as /var/log/2016/01, /var/log/2016/02) in the splunk index directory $SPLUNK_HOME/var/lib/splunk/ in compressed format.

In simple words, this is a copy of the source data but the size and format of the data is not same. Splunk stores the data in a series of index files.

For more read on how splunk indexes, please refer http://docs.splunk.com/Documentation/Splunk/6.3.1511/Indexer/HowSplunkstoresindexes

Hope this solves your queries to some extend.

renjith_nair · ‎01-17-2016

If you have configured Splunk to monitor a directory, Splunk picks up the files irrespective of whether it's copied manually or generated by some apps. Splunk checks the first bytes to check if the file was indexed previously and stores the events. If you want to exclude some files from a directory, that's also possible.

---
What goes around comes around. If it helps, hit it with Karma 🙂

renjith_nair · ‎01-17-2016

Sorry but what you mean by double raw data? Splunk picks up files from the directory and indexes it. It won't pick up the same file twice;Splunk checks first few bytes of file to see if it was already indexed

---
What goes around comes around. If it helps, hit it with Karma 🙂

seksit · ‎01-17-2016

Hi renjith.nair Thank you for your advice.

That log file import by manual don't use splunk forwarder (copy from external HDD).

If splunk monitor directory splunk will store raw data in splunk directory?

I have a question about raw data and index data in indexer. Please help me to understand.

Introducing the Splunk Community Dashboard Challenge!

Wondering How to Build Resiliency in the Cloud?

Updated Data Management and AWS GDI Inventory in Splunk Observability