200GB of data to Splunk

indikaw · ‎01-01-2013

Hi,

I have a set of 200Gb of compressed data to be added to Splunk from a dissferent server.
I managed to copy them on to the Splunk server and where now they resides.
It is consists of thousands of files and subfolders.

Could you please let me know how do I add these files to Splunk at one go?
I wouldn't mind even if it si few sets if I can break it down to lets say 500Gb each.
If I go ad file to splunk then it allows to select only one single file ata time.

I tried to zipped the full 200Gb but failed and I thought that's not an inteligent way to do this.

Please shed some light on this and appreciated.

lguinn2 · ‎01-02-2013

Compressing the files into a single zip will not help. Splunk must unzip the file in order to index it, so it doesn't save you anything.

If you ask Splunk to monitor the directory, it tracks which files are uploaded. While files are being indexed, Splunk tracks its current progress in each file, so that it will start where it left off in case of interruption. So you don't need to worry about overwrites or duplication.

There is one other alternative to monitor - and that is a sinkhole. You can't do this via the Splunk GUI, but you can tell Splunk that there is an upload directory. This is also called a batch input. When you move a file into this directory, Splunk indexes the file and then deletes it.

In $SPLUNK_HOME/etc/system/local/inputs.conf

[batch://YOURPATHHERE]
move_policy = sinkhole
host=HHHH
followSymlink = false

where YOURPATHHERE is the absolute path of the "sinkhole" directory. On Linux, this means that there will be three slashes in a row - two for the batch:// and one for the beginning of the path

And where HHHH is the host name that you want to give the data in Splunk.

Another good thing about this technique is that you can upload a few files per day, in order to stay under your license limit. Each day, just move the files that you want to index into the directory, and Splunk will upload and delete them.

More info here: http://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf

indikaw · ‎01-13-2013

Hi,

I implemented the sinkhole policy. It started well. But it never cleared the folder after indexing and as I saw it it kept on indexing same logs over and over again and my disk space ran out and indexing got paused. I was thinking why it didn't remove the files from the sinkhole directory after indexing? which caused this issue. Thats the way i will get a good idea how much more to index or has it been completed.

indikaw · ‎01-09-2013

Hi,

About the sinkhole directory, you mentioned that in Linux it will be 3 slashes. I am using Windows how should it look. ex:if my log files are in C:\Windows\Log can I use the path as batch=c:\Windows\Log as the sinkhole directory path? is it correct?
Also Can i append the above sinchole script you mentioned to input.conf. I can see there are some other stanzaalready in the input.conf.
Also after that I have to do a full Splunk restart?

lguinn2 · ‎01-02-2013

Splunk can normally index about 100 GB per day, depending on your configuration. But you can always search the data that has already been indexed - that will give you some indication of how far along Splunk has gotten. Just do a search. You may have to use index=yourindexname in your search

Also, there is a log file - splunkd.log - that will contain any error messages if Splunk has problems indexing.

lguinn2 · ‎01-02-2013

Go to Manager » Data inputs » Files & directories

Click the New button. Click Skip Preview and Continue

Select the first option: "Continuously index data from a file or directory this Splunk instance can access"

In the box beneath "Full path to your data", enter the path starting from the drive. For example: C:\temp\nyDirectory

Click More Settings

Under Index, use the dropdown to select the specific index where this data should go

indikaw · ‎01-02-2013

I am trying this all morning now. Even if I go to Add Dat and new input to monitor the folder it's not allowing me to select the folder and only allows me to select a single file inside the folder.
Bit confused about how to get this folder monitored and data indexed using the GUI.

indikaw · ‎01-02-2013

also i am running Splunk on Windows and not on Linux.
Is there anyway I can force this to index to a specific index?

indikaw · ‎01-02-2013

I had a look at the input.conf and it looks bit complex honestly. I would still like to go ahead with the monitor option. Since this is large amount of data i do not know when is the start and end of these files. So if i start to monitor this and how do I know it has successfully completed the indexing of full 200Gb?
Also I have a seperate index created to include these archived data. So Is it possible for me to send the monitor data to that index instead of default "main" ? if so how?

lguinn2 · ‎01-01-2013

First, you need to consider the size of your Splunk license. The free license only allows you to index 500MB per day. However, you could still potentially load all 200Gb - if your server can load it all in a single day or two. If you exceed your 500MB limit three times in 30 days, Splunk will lock its search function.

Second, you can ask Splunk to monitor a directory - which will load in all the files and subfolders. This will work better than uploading a single file. After everything is loaded, you can delete or disable the input in Splunk and remove the directory.

For a production environment, you should install the Splunk Universal Forwarder on the "different server." The Universal Forwarder would monitor the directory and forward the data to the Splunk server. This would allow you to continue to collect the data over time.

indikaw · ‎01-02-2013

Hi Lisa,

If I ask splunk to monitor the directory which has 200Gb data how do I make sure when it has loaded all of them to avoid any over wrties? [Consider license is not an issue] Bare in mind that this is a one off file loading to Splunk. Not a continuoues action.
If I want to add this as a single file which will take actually about couple of days I would imagine, how can I do that? Do I have to compress the full 200Gb in to a single file or can I is there anyother way around to do this without asking Splunk to monitor this

200GB of data to Splunk

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk