Solved: Why are my headers being indexed from my csv file?

lquinn · ‎03-16-2015

I am trying to do a simple monitor data input of a csv file with the following format:

 Id,User,Action,_time,Comment
 6783493,Laura,Purchase,1426503622.15467,Some Comment

I have tried several different configurations but each time the headers get indexed! The csv file changes when a saved search runs and outputs the csv. The headers never change. Can anyone tell me what I'm doing wrong? Surely just the standard csv sourcetype should do the trick? Thanks!

lquinn · ‎03-16-2015

The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!

View solution in original post

proletariat99 · ‎02-09-2016

Sooo... I've been battling this same thing off and on for the last couple of years. I've learned a few things that might help. First, you have to decide whether you're indexing all extracted fields (not recomended) or if you're doing search time field extractions. What happens to me is that I always test my extractions on a standalone box and it works like a champ and then everything breaks down in our distributed prod / uat environment. Regardless, this might help:

For search-time extractions, most of the relevant props.conf entries will be on the search head. The indexer will only have settings associated with index-timey things (like timestamp, linemerge, line breaker, host, sourcetype, etc -- all the lightweight schema stuff). On the SH, though, you can use a combination of these settings to do the extractions from the header:

CHECK_FOR_HEADER = TRUE
HEADER_FIELD_LINE_NUMBER = <NUMBER> (this one is cranky and unreliable, but sometimes works)
KV_MODE = <CSV, JSON, XML, etc>  (this one is also cranky and unreliable, especially with xml)

and if you want to be explicit (recommended in a lot of cases), you can use REPORT

REPORT-name-of-report = name_of_transforms.conf_stanza

Then transforms.conf on the search head will look something like this:

[name_of_transforms.conf_stanza]
DELIMS = ","
FIELDS = field1, field2, field3, etc... (these values match the values in the header IDENTICALLY)

Soooo... while this is pretty recommended for a large-scale distributed environment, it doesn't work well a lot of the time because of the relationship between line breakers and timestamp extractions on the indexers and the search head .conf files. Essentially, you set it all up, you think it should work and then it doesn't (but it did on a standalone)... then troubleshooting sucks.

For index-time extractions, you can use a combination of the following settings:

INDEXED_EXTRACTIONS = <blah>
PREAMBLE_REGEX = <match some pattern in the header>  (this actually ignores the first line, but uses it for the field names.)

So if you use PREAMBLE_REGEX, but want search time extractions, you can't (because that line is ignored by the time the search head sees it.).

Another method of troubleshooting, even if you don't plan on indexed extractions, is to turn on

INDEXED_EXTRACTIONS = csv

To see if it's your extractions on the SH or something else that's causing the problem.

And then there's the fishbucket :)... but that's another story... the hits just keep on coming...

lquinn · ‎03-16-2015

The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!

esix_splunk · ‎03-16-2015

Did you try to use

index_extractions = csv and
header_field_line_number = 1 for this source type?

http://docs.splunk.com/Documentation/Splunk/latest/Data/Extractfieldsfromfileheadersatindextime#Prop...

lquinn · ‎03-16-2015

Yep, I've tried that, and headers are still being indexed.

proletariat99 · ‎06-29-2015

same here. bug?

lquinn · ‎03-16-2015

The strange thing is that the headers are being extracted as field names but also as values. So I have User as a field value for User!

esix_splunk · ‎03-16-2015

I've just created a brand new csv and indexed it with the following:

props.conf
[indexed_extractions_test]
HEADER_FIELD_LINE_NUMBER=1
FIELD_DELIMITER=,
INDEXED_EXTRACTION = csv

csv-
os,range
AIX:Version,aix
FreeBSD:Version,freebsd
HPUX:Version,hpux
Linux:Version,linux
OSX:Version,osx
Solaris:Version,solaris
Unix:Version,unix

$splunk_home/bin/splunk add oneshot -index main -sourcetype indexed_extractions_test

The results are accurate. CSV is indexed without the header, and I have KV pairs for os=*:Version and range=solaris etc.

Make sure you are deleting the old indexed data before rerunning it.

lquinn · ‎03-16-2015

I've worked out that I don't think it is the configurations at all. I can also index a csv and it works fine but when I overwrite it with my search, that is when it starts indexing the headers.

esix_splunk · ‎03-16-2015

What are you doing in your search? Also note that _time is a reserved field. So using this fieldname could create problems.

lquinn · ‎03-16-2015

I've sorted it, thanks very much for your help! Sometimes you just need a bit of inspiration!

esix_splunk · ‎03-16-2015

Please let us know what you encountered, might help others down the road!

lquinn · ‎03-16-2015

Thats what I'm thinking too, I'm going to try a few things and I will let you know!

Why are my headers being indexed from my csv file?

Introducing the Splunk Community Dashboard Challenge!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...