I am trying to do a simple monitor data input of a csv file with the following format:
Id,User,Action,_time,Comment
6783493,Laura,Purchase,1426503622.15467,Some Comment
I have tried several different configurations but each time the headers get indexed! The csv file changes when a saved search runs and outputs the csv. The headers never change. Can anyone tell me what I'm doing wrong? Surely just the standard csv sourcetype should do the trick? Thanks!
The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!
Sooo... I've been battling this same thing off and on for the last couple of years. I've learned a few things that might help. First, you have to decide whether you're indexing all extracted fields (not recomended) or if you're doing search time field extractions. What happens to me is that I always test my extractions on a standalone box and it works like a champ and then everything breaks down in our distributed prod / uat environment. Regardless, this might help:
For search-time extractions, most of the relevant props.conf entries will be on the search head. The indexer will only have settings associated with index-timey things (like timestamp, linemerge, line breaker, host, sourcetype, etc -- all the lightweight schema stuff). On the SH, though, you can use a combination of these settings to do the extractions from the header:
CHECK_FOR_HEADER = TRUE
HEADER_FIELD_LINE_NUMBER = <NUMBER> (this one is cranky and unreliable, but sometimes works)
KV_MODE = <CSV, JSON, XML, etc> (this one is also cranky and unreliable, especially with xml)
and if you want to be explicit (recommended in a lot of cases), you can use REPORT
REPORT-name-of-report = name_of_transforms.conf_stanza
Then transforms.conf on the search head will look something like this:
[name_of_transforms.conf_stanza]
DELIMS = ","
FIELDS = field1, field2, field3, etc... (these values match the values in the header IDENTICALLY)
Soooo... while this is pretty recommended for a large-scale distributed environment, it doesn't work well a lot of the time because of the relationship between line breakers and timestamp extractions on the indexers and the search head .conf files. Essentially, you set it all up, you think it should work and then it doesn't (but it did on a standalone)... then troubleshooting sucks.
For index-time extractions, you can use a combination of the following settings:
INDEXED_EXTRACTIONS = <blah>
PREAMBLE_REGEX = <match some pattern in the header> (this actually ignores the first line, but uses it for the field names.)
So if you use PREAMBLE_REGEX, but want search time extractions, you can't (because that line is ignored by the time the search head sees it.).
Another method of troubleshooting, even if you don't plan on indexed extractions, is to turn on
INDEXED_EXTRACTIONS = csv
To see if it's your extractions on the SH or something else that's causing the problem.
And then there's the fishbucket :)... but that's another story... the hits just keep on coming...
The search that I was using to populate my csv, extracted the header fields using the rex command. For some reason when I then wrote over the monitored csv, this caused it to index the headers. I changed my search so that the header fields were extracted in props.conf rather than in the search string and the headers stopped being indexed! Not sure exactly why this was but there you go!
Did you try to use
index_extractions = csv and
header_field_line_number = 1 for this source type?
Yep, I've tried that, and headers are still being indexed.
same here. bug?
The strange thing is that the headers are being extracted as field names but also as values. So I have User as a field value for User!
I've just created a brand new csv and indexed it with the following:
props.conf
[indexed_extractions_test]
HEADER_FIELD_LINE_NUMBER=1
FIELD_DELIMITER=,
INDEXED_EXTRACTION = csv
csv-
os,range
AIX:Version,aix
FreeBSD:Version,freebsd
HPUX:Version,hpux
Linux:Version,linux
OSX:Version,osx
Solaris:Version,solaris
Unix:Version,unix
$splunk_home/bin/splunk add oneshot -index main -sourcetype indexed_extractions_test
The results are accurate. CSV is indexed without the header, and I have KV pairs for os=*:Version and range=solaris etc.
Make sure you are deleting the old indexed data before rerunning it.
I've worked out that I don't think it is the configurations at all. I can also index a csv and it works fine but when I overwrite it with my search, that is when it starts indexing the headers.
What are you doing in your search? Also note that _time is a reserved field. So using this fieldname could create problems.
I've sorted it, thanks very much for your help! Sometimes you just need a bit of inspiration!
Please let us know what you encountered, might help others down the road!
Thats what I'm thinking too, I'm going to try a few things and I will let you know!