I have the following types of events in FIX format. This is what they look like in vi or emacs:
<code>M|219620|0|i|I|20100506-16:15:53.443|463|8=FIX.4.4^A9=440^A35=i^A50=FXSpot M|219621|0|i|I|20100506-16:15:53.444|461|8=FIX.4.4^A9=438^A35=i^A50=FXSpot </code>
For the sake of simplicity, I have discarded the rest of the FIX message for this example. Notice, the ^A as the delimiter between "fields".
After indexing the data in Splunk, the ^A becomes hex x1 within Splunk Web and Splunk CLI.
<code>M|219620|0|i|I|20100506-16:15:53.443|463|8=FIX.4.4x19=440x135=ix150=FXSpot M|219621|0|i|I|20100506-16:15:53.444|461|8=FIX.4.4x19=438x135=ix150=FXSpot </code>
My props.conf looks like this:
<code>[FIX] SHOULD_LINEMERGE = false KV_MODE = none REPORT-all = get_all_fields </code>
My transforms.conf looks like this:
<code>[get_all_fields] DELIMS="x1" FIELDS = "a", "b", "c", "d" </code>
I have tried x1, x1, and x01. None of them extract the 4 "fields" in the example. What should the hex value be for the DELIMS to properly break the fields? Is there is a limitation where DELIMS can only take one character? I also tried using "", but that did not create any field extraction.
Yes, splunk will replace the unprintable character with their C-style hex notation before indexing. That can be quite annoying, but then again, so is trying to search for unprintable characters. If your curious, you can see a table of these conversions on the Wikipedia ASCII page, search down the page for the "Start of Header" character.
It seems like you have a fields inside of a field thing going on here, right?
You have fields delimited by a pipe (
|), and then the 8th field (at least in your given example) has and additional delimited field. I'm not sure how splunk handles that exactly. If you simply setup your delimiter as the
x1) then your first field would contain:
M|219620|0|i|I|20100506-16:15:53.443|463|8=FIX.4.4, when you probably only want it to contain
8=FIX.4.4. So simply getting your delimiter set properly isn't going to fully work.
I'm guessing it would make the most sense to first extract the outer set of fields first using
DELIMS="|" and then, setup a secondary field extract to pull out your embedded fields.
So, perhaps you would end up with something like this:
<code>[FIX] SHOULD_LINEMERGE = false KV_MODE = none REPORT-outer_fields = get_outer_fields, get_inner_fields </code>
[get_outer_fields] DELIMS="|" FIELDS = "f1", "f2", "f3", "f4", "f5", "_f6", "f7", "inner_fields"
REGEX = (?:^|x1) (?<a>.+)x1(?<b>.+)x1(?<c>.+)x1(?<d>.+)$
SOURCE_KEY = inner_fields
I think this should work. This does seem like a complicated scenario.
If the number of subfields is not constant (4), then you could use a multi-value field extraction like this: (That regex should work, it took me a few tries, but it seems to be best solution I could come up with)
[get_inner_fields] REGEX= (?=^|x1)(?:x1)?(?<my_fields>.+?)(?:x1)?(?=$|x1) SOURCE_KEY = inner_fields MV_ADD = True
Another possible option (and I don't know the FIX format at all, so this may not work). If the 8 in
8=FIX.4.4 means something like 'fix_version_number', you could just write a bunch of extracts that use the leading number of map to different field names. So for example of "8", you could add something like this to your props file:
EXTRACT-fix_field_8 = (?:||x1|^)8=(?<fix_version_number>.*?)(?:||x1|$)
Another thought (which may make all of the above options simpler) would be to add a
SEDCMD to your soucetype to change all of the
^A characters into something more useful at index time. Maybe something like a comma? (You would probably want to find a character or sequence of characters not already being used in your events)
Also, using a punctuation character like a comma also has the advantage of improving the way terms are segmented in your index which will let your search on more of these embedded fields more efficiently. For example, in your example event, you can search for
"8=FIX.4.4", but you can't search for
"50=FXSpot" because it's would be stored in the index as
"150=FXSpot", you would have to search with "*50=FXSpot" instead. Using a better punctuation character works around this problem.
One more option. Email Glenn and take a look at a custom search command he is using to handle FIX log processing. See his post here:
If you're just trying to substitute the SOH character I was FINALLY able to do it after spending a ton of time and it's a very simple solution. I may be reiterating what Lowell said but hopefully this example saves a ton of time for someone else. Additionally the solution handles it at index time and not at search time. So it makes it easier to read for users who don't realize there's a SOH delimiter to deal with:
edit $SPLUNK_HOME/etc/system/local/props.conf (on the indexer box if your search head and indexer are 2 different boxes) and add the following:
[myfixsourcetype] SEDCMD-stripsoh = s/x01/ /g
Then restart Splunk. Now any NEW FIX data will have the SOH character replaced with a space character. This will NOT affect existing, indexed FIX data in Splunk already.
Note: Of course, the "myfixsourcetype" needs to be replaced with the actual sourcetype name that your FIX data is coming in as otherwise it has no way of identifying your data in order to apply the sed command to. See props.conf spec for other data identifiers you can use (ie. host or source).
FYI - I'm running Splunk on a RedHat Linux box.
FIX protocol field delimiter
Splunk 6, FIX 4.2
Another approach is to use the key value pair extractions defined in transforms.conf.
The short of it is that its using negative lookahead to not match on x01.
To register this extraction, following: link text
REPORT-fields = fixkv
REGEX = (d+)=((?:(?!x01).)+)
FORMAT = $1::$2
Hope this helps someone. If anyone has suggestions on how to make this one more efficient, please feel free to add.