Hi --
I'm having some trouble with search-time field extractions that I've set up in the Splunk Manager. My tab-separated input data looks like:
12345 some_junk prefix789 morejunk
and I'm trying to extract the "789" portion into a field. My field extraction regex looks like:
^(?<req_id>[^t]*)\t[^\t]+\tprefix(?<some_id>[^\t]*)\t
This extraction does work, but I'm seeing some strange behavior when I use that some_id field as part of my search. For example, a search for:
sourcetype=mylog some_id=789
returns zero results. If I search for:
sourcetype=mylog some_id=*789
I get results. When I look at the field discovery panel and click on the some_id entry, it correctly shows 789 as the value of the some_id field (and not prefix789) in 100% of the results. If I do:
sourcetype=mylog | convert auto(some_id) | search some_id=789
I get results (same as the *789 search above). I checked the length of the extracted value for the some_id field:
sourcetype=mylog some_id=*789 | eval idlen=len(some_id)
which correctly shows 3.
So - my question is: why doesn't my original search for some_id=789 return any results, but my search for some_id=*789 does?
Is there a way to specify the convert auto() as part of the field extraction, so I don't have to include it in every search that uses some_id?
Thanks!
--Colin
I'm not 100% sure here (only perhaps 95%), but I'm going to take a guess that this has to do with your prefix, how the base index search is run and then field extractions are applied, and index-time segmentation.
If you take your first search
sourcetype=mylog some_id=789
and look at it through the search job inspector, you will see near the top the base search that is being run against Splunk's keyword index (search for LISPY). It will probably look something (but perhaps not exactly - I'm doing this from memory) like:
[ AND sourcetype::mylog 789 ]
This means that Splunk is searching the keyword index for instances of sourcetype::mylog
and instances of 789
. Because of the late binding of regex field extractions, the concept of the field some_id
does not exist yet. All of the events that match the base search will be run through the regex, and then once the field some_id
exists, they will be further processed to find instances of some_id=789
.
Now, backing up to when these events were originally indexed, Splunk will do segmentation of the raw text of the event in order to find the keywords to store in the keyword index. Certain character classes are defined as delineating keywords. With the default indexing rules, and event of the type:
09/14/2012 15:00:00 foo bar baz12345 meh=quux
will produce keywords in the index similar to "09", "14", "2012", "15", "foo", "bar", "baz12345", "meh", "quux". (It may produce others like "09/14/12" and "15:00:00", but I don't necessarily entirely grok this part)
Now, looking specifically at baz12345
- there was no segmentation between baz
and 12345
. Only the "whole word" is stored in the keyword index. So, to find 12345
in the keyword index, I would need to search on *12345
.
Your additional searches after the |
are able to take advantage of the fact that field extraction has already happened. A search of the form:
sourcetype=mylog | search some_id=789
will only check the keyword index for sourcetype=mylog
, returning all events of that sourcethype. Splunk will then do field extraction, and the downstream search for some_id=789
will not go back to the keyword index again.
Some things you can do to resolve this include:
some_id
an indexed field. This could solve the issue, but at a not small cost in terms of complexity and maintenance.prefix789
as two different keywords, prefix
and 789
. Then 789
would appear alone in the keyword index, making it searchable.Of options #2 and #3, I'm not sure which is the lesser of the two evils, and I would not make a decision to do either one w/o discussing with Splunk support.
dwaddle's answer is excellent. However, as of some time in 4.2-4.3, there is a better option than any of the 3 he has provided. You can configue fields.conf to suit the extraction of the suffix as a separate field with a setting as follows:
[some_id]
INDEXED_VALUE = prefix<VALUE>
This will override the default conversion of a field value as have described above. So if you look for some_id=789
that will instead convert to a search for the string prefix789
, which is what you want. See the spec field for fields.conf for other options on INDEXED_VALUE
.
I guess if the prefix is variable, you can set the indexed field value to *<VALUE>
literally. This will be a much less efficient search than with a fixed prefix (it might be as bad as simply rewriting with | search ...
, but will at least work correctly and be less confusing.
I'm not 100% sure here (only perhaps 95%), but I'm going to take a guess that this has to do with your prefix, how the base index search is run and then field extractions are applied, and index-time segmentation.
If you take your first search
sourcetype=mylog some_id=789
and look at it through the search job inspector, you will see near the top the base search that is being run against Splunk's keyword index (search for LISPY). It will probably look something (but perhaps not exactly - I'm doing this from memory) like:
[ AND sourcetype::mylog 789 ]
This means that Splunk is searching the keyword index for instances of sourcetype::mylog
and instances of 789
. Because of the late binding of regex field extractions, the concept of the field some_id
does not exist yet. All of the events that match the base search will be run through the regex, and then once the field some_id
exists, they will be further processed to find instances of some_id=789
.
Now, backing up to when these events were originally indexed, Splunk will do segmentation of the raw text of the event in order to find the keywords to store in the keyword index. Certain character classes are defined as delineating keywords. With the default indexing rules, and event of the type:
09/14/2012 15:00:00 foo bar baz12345 meh=quux
will produce keywords in the index similar to "09", "14", "2012", "15", "foo", "bar", "baz12345", "meh", "quux". (It may produce others like "09/14/12" and "15:00:00", but I don't necessarily entirely grok this part)
Now, looking specifically at baz12345
- there was no segmentation between baz
and 12345
. Only the "whole word" is stored in the keyword index. So, to find 12345
in the keyword index, I would need to search on *12345
.
Your additional searches after the |
are able to take advantage of the fact that field extraction has already happened. A search of the form:
sourcetype=mylog | search some_id=789
will only check the keyword index for sourcetype=mylog
, returning all events of that sourcethype. Splunk will then do field extraction, and the downstream search for some_id=789
will not go back to the keyword index again.
Some things you can do to resolve this include:
some_id
an indexed field. This could solve the issue, but at a not small cost in terms of complexity and maintenance.prefix789
as two different keywords, prefix
and 789
. Then 789
would appear alone in the keyword index, making it searchable.Of options #2 and #3, I'm not sure which is the lesser of the two evils, and I would not make a decision to do either one w/o discussing with Splunk support.
It's safe to try the suggestion I make w/o consulting Splunk support. If it works for you, great. If it's bad (and I see no reason it should be, unless you've oversimplified your explanation, e.g., the prefix
is not a just a simple string) you can just remove the fields.conf entry.
Your answer is excellent. However, as of some time in 4.2-4.3, there is a better option than ay of the 3 you have provided. You can configue fields.conf to suit the extraction of the suffix as a separate field with a setting as follows:
[some_id]
INDEXED_VALUE = prefix<VALUE>
This will override the default conversion of a field value you have described above. So if you look for some_id=789
that will instead convert to a search for the string prefix789
, which is what you want.