Re: simple field extraction not working

pjmenon · ‎06-29-2010

I've been breaking my head over this very simple field extraction.

My extraction (see eg., below) has problems because my time format has "-" and so do my other fields. I cannot specify the position of timestamp since I have 2-3 timestamps in an event. what is the best way to extract these fields?

props.conf

[source::C:\Documents and Settings\Sample]
TIME_FORMAT= %Y-%M-%D  %H:%M:%S
CHECK_FOR_HEADER = false 
REPORT-test = test

transforms.conf:

[test] 
DELIMS = ","
FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_time","duration", "rsv2"

Sample in input file:

MINOR,56789,/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1, FAIL, , 2010-06-24 21:57:46,2010-06-24 21:59:23,0 00:01:37,N/A

Splunk search result

Severity=MINOR |  alm_no=56789  |  site_id=/aaa/ |  start_time=-bbb-bbb/tv-d  |  end_time=o  |  duration=/Rkhkjkj #2/Shelf #2/jjj #1 |  rsv2_par=FAIL

pjmenon · ‎06-30-2010

(Adding as answer since I cannot post this as comment)

Sorry to confuse with 2 postings. I was monitoring the other one. You may be right on about the leakage!!! I don't know how this is happening. If I have my transforms exactly like above, I get a blank when I run the command. BUT - when I have the first line in transforms.conf as blank line and run the command I get this : C:\Program Files\Splunk\bin>btool transforms list test

[test] CAN_OPTIMIZE = True

CLEAN_KEYS = True

DEFAULT_VALUE =

DELIMS = ","\par

DEST_KEY =

FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_ time","duration", "rsv2"\par

FORMAT =

LOOKAHEAD = 4096

MV_ADD = False

REGEX =

SOURCE_KEY = _raw

WRITE_META = False

What in the world in \par at the end of my DELIMS? I don't see it when I open the file. Is this causing problem?

BTW, I am very new to his tool and this is the first transforms, props that I am editing. everything else is brand new , i.e, no meddling.

Update after solving DELIMS issue After I solved the \par mystery, tried again to see that my field extractions are not showing up again. Used test source and here is what I get:

C:\Program Files\Splunk\bin>splunk test sourcetype "C:\Documents and Settings\Sample\trial.csv"

Using logging configuration at C:\Program Files\Splunk\etc\log-cmdline.cfg. INFO FileClassifierManager - AutoHeader: delim=',', score=1.000, count=17, mode =8.000, filename="trial.csv"

INFO FileClassifierManager - AutoHeader: filename="trial.csv", found headerline =[MINOR,56789,/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1, FAIL, , 2010-06-24 21:57:46,2010-06-24 21:59:23,0 00:01:37,N/A]

INFO FileClassifierManager - AutoHeader: skipped saving. found exact transforms .conf entry, stanza_name="REPORT-AutoHeader" linked in props="csv-3", filename=" trial.csv"

INFO FileClassifierManager - AutoHeader: changing sourcetype from="csv" to="csv -3" for filename="trial.csv"

PROPERTIES OF C:\Documents and Settings\Sample\trial.csv

    Attr:BREAK_ONLY_BEFORE
    Attr:BREAK_ONLY_BEFORE_DATE     True
    Attr:CHARSET    AUTO
    Attr:CHECK_FOR_HEADER   true
    Attr:DATETIME_CONFIG    \etc\datetime.xml
    Attr:KV_MODE    none
    Attr:LEARN_SOURCETYPE   true
    Attr:MAX_DAYS_AGO       2000
    Attr:MAX_DAYS_HENCE     2
    Attr:MAX_DIFF_SECS_AGO  3600
    Attr:MAX_DIFF_SECS_HENCE        604800
    Attr:MAX_EVENTS 256
    Attr:MAX_TIMESTAMP_LOOKAHEAD    128
    Attr:MUST_BREAK_AFTER
    Attr:MUST_NOT_BREAK_AFTER
    Attr:MUST_NOT_BREAK_BEFORE
    Attr:REPORT-AutoHeader  AutoHeader-2
    Attr:SEGMENTATION       indexing
    Attr:SEGMENTATION-all   full
    Attr:SEGMENTATION-inner inner
    Attr:SEGMENTATION-outer outer
    Attr:SEGMENTATION-raw   none
    Attr:SEGMENTATION-standard      standard
    Attr:SHOULD_LINEMERGE   False
    Attr:TRANSFORMS
    Attr:TRUNCATE   10000
    Attr:is_valid   True
    Attr:maxDist    100
    Attr:pulldown_type      true
    Attr:sourcetype csv-3

pjmenon · ‎07-01-2010

ok, I have set CHECK_FOR_HEADER as false but it is ignored anyway.

pjmenon · ‎07-01-2010

Looks like having csv as sourcetype is the issue. The auto-learing creates issues when it sees csv as source type and takes stanza from the directory you mentioned. There is some goos information at http://www.splunk.com/base/Documentation/4.1.3/Admin/Aboutdefaultfields
I'll do changes to CHECK_FOR_HEADER and source type. Will post what I see.

Lowell · ‎07-01-2010

Also, make sure that you end up seeing REPORT-<class> = test or you transformer will not be used. (Right now it's missing)

Lowell · ‎07-01-2010

Try setting CHECK_FOR_HEADER=False. That's probably your issue. Sounds like you have fixed fields anyways, so it's better not to use it. (I've never had success with using the auto header thing.... Go look in etc/apps/learning/local/transforms.conf to see what "AutoHeader-2" is setup to do) When you move to a more production-level situation, make sure you are actually assigning a souretype name (other than csv-3), this is especially important for delimited files, because you can only have one field-extraction setup per sourcetype.

pjmenon · ‎06-30-2010

Thanks much for being so helpful! See above for props.conf issue.

Lowell · ‎06-30-2010

wow. very weird. Good to know it's not a splunk bug though. (You don't have to wait for the next release.) Feel free to reference my question about tracking down props.conf bugs (I put i link in one of my posts), I tried to collect a bunch of common mistakes that I made when I started out. Glad its working for you now. Best of luck moving forward!

pjmenon · ‎06-30-2010

FINALLY found the issue!! because of the strange characters appending to the FILEDS and DELIMS, I was suspicious and opened the files with a different editor. Voila! I saw these wierd characters and deleted them. Now the output to btool looks clean.

Lowell · ‎06-30-2010

(This is really just another comment, but I need better formatting so I'm posting it as an "answer")

I copied your example to a test file on my system (Splunk 4.1.3 on Ubuntu Linux 8.04) to attempt to reproduce your issue. I saved your sample in a local file ("/tmp/weird_delims.log") and added the following entries in $SPLUNK_HOME/etc/system/local:

props.conf:

 [weird_delims-too_small]
 # This it the automatically assigned sourcetype (based on the name I gave the test input file)
 REPORT-test = test

transforms.conf: (exactly the same as yours)

[test]
DELIMS = ","
FIELDS = "severity", "alm_no", "site_id", "alm_type","rsv1", "start_time", "end_time","duration", "rsv2"

Then I ran the following search:

| file /tmp/weird_delims.log | extract reload=T

Quick explanation: The file command let me see your log file in splunk without actually indexing it. (It's like a "preview"). The the extract tells splunk to reload the props/transforms stuff without restarting splunkd.

When looking at the fields, they all look correct. I'm not going to copy them all, but here are a few of them:

alm_no=56789
duration=0 00:01:37
rsv2=N/A
severity=MINOR
site_id=/aaa-bbb-bbb/tv-daop/Rkhkjkj #2/Shelf #2/jjj #1

I'm not seeing any of the weirdness that you are experiencing with your configs.

Have you tried using btool to show your config entries, just to double check that you don't have overlapping configs? You should also try running:

splunk test sourcetype "C:\Documents and Settings\Sample\name_of_your_file"

And see what all props settings come back. I suspect you have some kind of overlapping config that is causing problems for you.

You may find this helpful: What’s the best way to track down props.conf problems?

Update

I can reproduce the behavior your seeing exactly by simply setting DELIMS = "ap,". I'm fairly convinced at this point that you have some kind of weird config stanza leakage thing going on. btool is your friend. (If you accidentally set DELIMS = to your list of fields, you see something similar.)

Please post the results of the command:

C:\Program Files\Splunk\bin\btool.exe transforms list test

pjmenon · ‎06-30-2010

result of splunk test sourcetype below. I know AutoHeader is doing something but not sure what. Still not seeing my custom fields in search.

Lowell · ‎06-29-2010

I'm not sure what's going on with your "-" character in your event. Are you generating these CSV files, or is that some app outside of your control? (The reason I ask is that using some double quotes could make this easier to process)

BTW, You still can specify which timestamp to use using TIME_PREFIX, even if you have a CSV file, but it does look more ugly with csv files. Here is an example that would use your start_time field as the timestamp:

 SHOULD_LINEMERGE = False
 TIME_PREFIX = ^(?:[^,]*,){5}\s*
 TIME_FORMAT= %Y-%m-%d %H:%M:%S
 MAX_TIMESTAMP_LOOKAHEAD = 256

Your existing timestamp format was incorrect. This one should work for you (notice the lowercase %m and %d in the date portion). You may need to play with the value for MAX_TIMESTAMP_LOOKAHEAD depending on how much data exists before your timestamps per line.

Update: Please understand that splunk can only extract a single timestamp for an event. This timestamp is what is displayed on the left of event when looking at the search web interface. Splunk uses this to order how events are stored internally, and this value can be found in the _time field. Now, since your event has multiple timestamp in the raw text of your event, you simply have to pick one to be your primary timestamp that splunk uses. Keep in mind that this is different than defining custom fields which happen to contain timestamps, which is the case for your "start_time" and "end_time" custom-defined fields. (You can use the same actual timestamp for both an extracted field and for the events actual timestamp.)

Also keep in mind that if you have commas inside your csv fields, or any kind of special quoting, then a more sophisticated regex would be needed.

As far as extracting your fields, the delimited approach should work for you. However, there are times where you need to use a regex-based approach. Something like: (this goes in your props.conf file)

EXTRACT-csv_fields = ^(?<severity>[^,]*),(?<alm_no>[^,]+),(?<site_id>[^,]*), ....

Yeah, this approach is very tedious and somewhat error prone, but it does give you ultimate control over how your fields are extracted.

Does this help?

Lowell · ‎06-30-2010

Wow. That is very odd. What version are you running? Have you tried renaming your 'test' stanza to another name (perhaps there is a conflict somewhere).

pjmenon · ‎06-30-2010

With all this insanity, I can't even conclude if it is just the "a" or if any other character has the same issue.

pjmenon · ‎06-30-2010

Per your suggestion, I was about to restate my question. Since strange things were happening, I thought i'll test it out thoroughly before posing the question. I can't explain what I am seeing: it is NOT the dash that is causing the problem , it is "a" (lower-case a) . Insane! a acts as delimiter. When I manually changed all my "a"s to "A"s, things work fine. Am I missnig something?

Lowell · ‎06-29-2010

BTW. Explicitly specifying a TIME_PREFIX and TIME_FORMAT is recommended by splunk for performance reasons; and I've found it to be worth the effort. Especially since you have two dates, it's probably better to be explicit about it--you don't want splunk to randomly pick which timestamp to use, you probably want it to always be the same. Just my 2 cents.

Lowell · ‎06-29-2010

Try updating your question with what you now know. You may want to also change the title of your question to something like: "Delimited field extraction not working when field contains dashes" You question should get more attention this way. I'm wondering if this is a bug. Is the sample posted a literal example, or has it been tweaked for posting on the web? Nothing about "-" should be special in this case....

pjmenon · ‎06-29-2010

ok, that explains everything. Thanks for clearing up the internal vs extracted time. Splunk extracts the internal timestamp just fine (looked on the left ). I was confused because my start_time and end_time were not extracted. Now the problem becomes entirely differet. I removed everything about TIME_FORMAT from props.conf and kept transforms.conf.
a) things work fine if there are no dashes in any fields (fields 1-5)
b) extraction is absurd when there are any dash anywhere in the fields. so DELIMITS is not doing the job.

Lowell · ‎06-29-2010

I think there's some confusion on timestamps. I'd added an update in the answer. The value for TIME_FORMAT is in no way used to extract the value of your start_time or end_time fields. TIME_FORMAT and TIME_PREFIX are only used to extract splunk's internal timestamp associated with your event. (This could/should be "just working" out of the box, so perhaps I completely misunderstood where your problem was to begin with.)

Lowell · ‎06-29-2010

Whoops, I had a typo (extra space) in TIME_FORMAT. (I updated my answer) And yes, case is very important here, %m means month, where as %M means minute which are NOT interchangeable. Make sure you are restarting splunk after you changes and then feeding in new events since these are index-time changes and not search-time changes; so only newly indexed events will be effected by the such changes... Splunk only supports a single timestamp per event, however you can extract the other timestamps and do field manipulation with them, but you have to chose 1 to be your event's timestamp.

pjmenon · ‎06-29-2010

I have no control over the CSV file for now. I'll think about ways to change that in the future.

simple field extraction not working

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk