I've been breaking my head over this very simple field extraction.
My extraction (see eg., below) has problems because my time format has "-" and so do my other fields. I cannot specify the position of timestamp since I have 2-3 timestamps in an event. what is the best way to extract these fields?
props.conf
[source::C:\Documents and Settings\Sample]
TIME_FORMAT= %Y-%M-%D %H:%M:%S
CHECK_FOR_HEADER = false
REPORT-test = test
transforms.conf:
[test]
DELIMS = ","
FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_time","duration", "rsv2"
Sample in input file:
MINOR,56789,/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1, FAIL, , 2010-06-24 21:57:46,2010-06-24 21:59:23,0 00:01:37,N/A
Splunk search result
Severity=MINOR | alm_no=56789 | site_id=/aaa/ | start_time=-bbb-bbb/tv-d | end_time=o | duration=/Rkhkjkj #2/Shelf #2/jjj #1 | rsv2_par=FAIL
(Adding as answer since I cannot post this as comment)
Sorry to confuse with 2 postings. I was monitoring the other one. You may be right on about the leakage!!! I don't know how this is happening. If I have my transforms exactly like above, I get a blank when I run the command. BUT - when I have the first line in transforms.conf as blank line and run the command I get this : C:\Program Files\Splunk\bin>btool transforms list test
[test] CAN_OPTIMIZE = True
CLEAN_KEYS = True
DEFAULT_VALUE =
DELIMS = ","\par
DEST_KEY =
FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_ time","duration", "rsv2"\par
FORMAT =
LOOKAHEAD = 4096
MV_ADD = False
REGEX =
SOURCE_KEY = _raw
WRITE_META = False
What in the world in \par at the end of my DELIMS? I don't see it when I open the file. Is this causing problem?
BTW, I am very new to his tool and this is the first transforms, props that I am editing. everything else is brand new , i.e, no meddling.
Update after solving DELIMS issue After I solved the \par mystery, tried again to see that my field extractions are not showing up again. Used test source and here is what I get:
C:\Program Files\Splunk\bin>splunk test sourcetype "C:\Documents and Settings\Sample\trial.csv"
Using logging configuration at C:\Program Files\Splunk\etc\log-cmdline.cfg. INFO FileClassifierManager - AutoHeader: delim=',', score=1.000, count=17, mode =8.000, filename="trial.csv"
INFO FileClassifierManager - AutoHeader: filename="trial.csv", found headerline =[MINOR,56789,/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1, FAIL, , 2010-06-24 21:57:46,2010-06-24 21:59:23,0 00:01:37,N/A]
INFO FileClassifierManager - AutoHeader: skipped saving. found exact transforms .conf entry, stanza_name="REPORT-AutoHeader" linked in props="csv-3", filename=" trial.csv"
INFO FileClassifierManager - AutoHeader: changing sourcetype from="csv" to="csv -3" for filename="trial.csv"
PROPERTIES OF C:\Documents and Settings\Sample\trial.csv
Attr:BREAK_ONLY_BEFORE
Attr:BREAK_ONLY_BEFORE_DATE True
Attr:CHARSET AUTO
Attr:CHECK_FOR_HEADER true
Attr:DATETIME_CONFIG \etc\datetime.xml
Attr:KV_MODE none
Attr:LEARN_SOURCETYPE true
Attr:MAX_DAYS_AGO 2000
Attr:MAX_DAYS_HENCE 2
Attr:MAX_DIFF_SECS_AGO 3600
Attr:MAX_DIFF_SECS_HENCE 604800
Attr:MAX_EVENTS 256
Attr:MAX_TIMESTAMP_LOOKAHEAD 128
Attr:MUST_BREAK_AFTER
Attr:MUST_NOT_BREAK_AFTER
Attr:MUST_NOT_BREAK_BEFORE
Attr:REPORT-AutoHeader AutoHeader-2
Attr:SEGMENTATION indexing
Attr:SEGMENTATION-all full
Attr:SEGMENTATION-inner inner
Attr:SEGMENTATION-outer outer
Attr:SEGMENTATION-raw none
Attr:SEGMENTATION-standard standard
Attr:SHOULD_LINEMERGE False
Attr:TRANSFORMS
Attr:TRUNCATE 10000
Attr:is_valid True
Attr:maxDist 100
Attr:pulldown_type true
Attr:sourcetype csv-3
ok, I have set CHECK_FOR_HEADER as false but it is ignored anyway.
Looks like having csv as sourcetype is the issue. The auto-learing creates issues when it sees csv as source type and takes stanza from the directory you mentioned. There is some goos information at http://www.splunk.com/base/Documentation/4.1.3/Admin/Aboutdefaultfields
I'll do changes to CHECK_FOR_HEADER and source type. Will post what I see.
Also, make sure that you end up seeing REPORT-<class> = test
or you transformer will not be used. (Right now it's missing)
Try setting CHECK_FOR_HEADER=False
. That's probably your issue. Sounds like you have fixed fields anyways, so it's better not to use it. (I've never had success with using the auto header thing.... Go look in etc/apps/learning/local/transforms.conf
to see what "AutoHeader-2" is setup to do) When you move to a more production-level situation, make sure you are actually assigning a souretype name (other than csv-3
), this is especially important for delimited files, because you can only have one field-extraction setup per sourcetype.
Thanks much for being so helpful! See above for props.conf issue.
wow. very weird. Good to know it's not a splunk bug though. (You don't have to wait for the next release.) Feel free to reference my question about tracking down props.conf bugs (I put i link in one of my posts), I tried to collect a bunch of common mistakes that I made when I started out. Glad its working for you now. Best of luck moving forward!
FINALLY found the issue!! because of the strange characters appending to the FILEDS and DELIMS, I was suspicious and opened the files with a different editor. Voila! I saw these wierd characters and deleted them. Now the output to btool looks clean.
(This is really just another comment, but I need better formatting so I'm posting it as an "answer")
I copied your example to a test file on my system (Splunk 4.1.3 on Ubuntu Linux 8.04) to attempt to reproduce your issue. I saved your sample in a local file ("/tmp/weird_delims.log") and added the following entries in $SPLUNK_HOME/etc/system/local:
props.conf:
[weird_delims-too_small]
# This it the automatically assigned sourcetype (based on the name I gave the test input file)
REPORT-test = test
transforms.conf: (exactly the same as yours)
[test]
DELIMS = ","
FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_time","duration", "rsv2"
Then I ran the following search:
| file /tmp/weird_delims.log | extract reload=T
Quick explanation: The
file
command let me see your log file in splunk without actually indexing it. (It's like a "preview"). The theextract
tells splunk to reload the props/transforms stuff without restarting splunkd.
When looking at the fields, they all look correct. I'm not going to copy them all, but here are a few of them:
alm_no=56789
duration=0 00:01:37
rsv2=N/A
severity=MINOR
site_id=/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1
I'm not seeing any of the weirdness that you are experiencing with your configs.
Have you tried using btool
to show your config entries, just to double check that you don't have overlapping configs? You should also try running:
splunk test sourcetype "C:\Documents and Settings\Sample\name_of_your_file"
And see what all props settings come back. I suspect you have some kind of overlapping config that is causing problems for you.
You may find this helpful: What’s the best way to track down props.conf problems?
Update
I can reproduce the behavior your seeing exactly by simply setting DELIMS = "ap,"
. I'm fairly convinced at this point that you have some kind of weird config stanza leakage thing going on. btool
is your friend. (If you accidentally set DELIMS =
to your list of fields, you see something similar.)
Please post the results of the command:
C:\Program Files\Splunk\bin\btool.exe transforms list test
result of splunk test sourcetype below. I know AutoHeader is doing something but not sure what. Still not seeing my custom fields in search.
I'm not sure what's going on with your "-" character in your event. Are you generating these CSV files, or is that some app outside of your control? (The reason I ask is that using some double quotes could make this easier to process)
BTW, You still can specify which timestamp to use using TIME_PREFIX
, even if you have a CSV file, but it does look more ugly with csv files. Here is an example that would use your start_time field as the timestamp:
SHOULD_LINEMERGE = False
TIME_PREFIX = ^(?:[^,]*,){5}\s*
TIME_FORMAT= %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 256
Your existing timestamp format was incorrect. This one should work for you (notice the lowercase %m and %d in the date portion). You may need to play with the value for MAX_TIMESTAMP_LOOKAHEAD
depending on how much data exists before your timestamps per line.
Update: Please understand that splunk can only extract a single timestamp for an event. This timestamp is what is displayed on the left of event when looking at the search web interface. Splunk uses this to order how events are stored internally, and this value can be found in the _time
field. Now, since your event has multiple timestamp in the raw text of your event, you simply have to pick one to be your primary timestamp that splunk uses. Keep in mind that this is different than defining custom fields which happen to contain timestamps, which is the case for your "start_time" and "end_time" custom-defined fields. (You can use the same actual timestamp for both an extracted field and for the events actual timestamp.)
Also keep in mind that if you have commas inside your csv fields, or any kind of special quoting, then a more sophisticated regex would be needed.
As far as extracting your fields, the delimited approach should work for you. However, there are times where you need to use a regex-based approach. Something like: (this goes in your props.conf
file)
EXTRACT-csv_fields = ^(?<severity>[^,]*),(?<alm_no>[^,]+),(?<site_id>[^,]*), ....
Yeah, this approach is very tedious and somewhat error prone, but it does give you ultimate control over how your fields are extracted.
Does this help?
Wow. That is very odd. What version are you running? Have you tried renaming your 'test' stanza to another name (perhaps there is a conflict somewhere).
With all this insanity, I can't even conclude if it is just the "a" or if any other character has the same issue.
Per your suggestion, I was about to restate my question. Since strange things were happening, I thought i'll test it out thoroughly before posing the question. I can't explain what I am seeing: it is NOT the dash that is causing the problem , it is "a" (lower-case a) . Insane! a acts as delimiter. When I manually changed all my "a"s to "A"s, things work fine. Am I missnig something?
BTW. Explicitly specifying a TIME_PREFIX
and TIME_FORMAT
is recommended by splunk for performance reasons; and I've found it to be worth the effort. Especially since you have two dates, it's probably better to be explicit about it--you don't want splunk to randomly pick which timestamp to use, you probably want it to always be the same. Just my 2 cents.
Try updating your question with what you now know. You may want to also change the title of your question to something like: "Delimited field extraction not working when field contains dashes" You question should get more attention this way. I'm wondering if this is a bug. Is the sample posted a literal example, or has it been tweaked for posting on the web? Nothing about "-" should be special in this case....
ok, that explains everything. Thanks for clearing up the internal vs extracted time. Splunk extracts the internal timestamp just fine (looked on the left ). I was confused because my start_time and end_time were not extracted. Now the problem becomes entirely differet. I removed everything about TIME_FORMAT from props.conf and kept transforms.conf.
a) things work fine if there are no dashes in any fields (fields 1-5)
b) extraction is absurd when there are any dash anywhere in the fields. so DELIMITS is not doing the job.
I think there's some confusion on timestamps. I'd added an update in the answer. The value for TIME_FORMAT
is in no way used to extract the value of your start_time
or end_time
fields. TIME_FORMAT
and TIME_PREFIX
are only used to extract splunk's internal timestamp associated with your event. (This could/should be "just working" out of the box, so perhaps I completely misunderstood where your problem was to begin with.)
Whoops, I had a typo (extra space) in TIME_FORMAT
. (I updated my answer) And yes, case is very important here, %m
means month, where as %M
means minute which are NOT interchangeable. Make sure you are restarting splunk after you changes and then feeding in new events since these are index-time changes and not search-time changes; so only newly indexed events will be effected by the such changes... Splunk only supports a single timestamp per event, however you can extract the other timestamps and do field manipulation with them, but you have to chose 1 to be your event's timestamp.
I have no control over the CSV file for now. I'll think about ways to change that in the future.