Splunk Search

Field extractions, data types and convert()

cmurtaugh
Engager

Hi --

I'm having some trouble with search-time field extractions that I've set up in the Splunk Manager. My tab-separated input data looks like:

12345 some_junk prefix789 morejunk

and I'm trying to extract the "789" portion into a field. My field extraction regex looks like:

^(?<req_id>[^t]*)\t[^\t]+\tprefix(?<some_id>[^\t]*)\t 

This extraction does work, but I'm seeing some strange behavior when I use that some_id field as part of my search. For example, a search for:

sourcetype=mylog some_id=789

returns zero results. If I search for:

sourcetype=mylog some_id=*789

I get results. When I look at the field discovery panel and click on the some_id entry, it correctly shows 789 as the value of the some_id field (and not prefix789) in 100% of the results. If I do:

sourcetype=mylog | convert auto(some_id) | search some_id=789

I get results (same as the *789 search above). I checked the length of the extracted value for the some_id field:

sourcetype=mylog some_id=*789 | eval idlen=len(some_id)

which correctly shows 3.

So - my question is: why doesn't my original search for some_id=789 return any results, but my search for some_id=*789 does?

Is there a way to specify the convert auto() as part of the field extraction, so I don't have to include it in every search that uses some_id?

Thanks!

--Colin

1 Solution

dwaddle
SplunkTrust
SplunkTrust

I'm not 100% sure here (only perhaps 95%), but I'm going to take a guess that this has to do with your prefix, how the base index search is run and then field extractions are applied, and index-time segmentation.

If you take your first search

sourcetype=mylog some_id=789

and look at it through the search job inspector, you will see near the top the base search that is being run against Splunk's keyword index (search for LISPY). It will probably look something (but perhaps not exactly - I'm doing this from memory) like:

[ AND sourcetype::mylog 789 ]

This means that Splunk is searching the keyword index for instances of sourcetype::mylog and instances of 789. Because of the late binding of regex field extractions, the concept of the field some_id does not exist yet. All of the events that match the base search will be run through the regex, and then once the field some_id exists, they will be further processed to find instances of some_id=789.

Now, backing up to when these events were originally indexed, Splunk will do segmentation of the raw text of the event in order to find the keywords to store in the keyword index. Certain character classes are defined as delineating keywords. With the default indexing rules, and event of the type:

09/14/2012 15:00:00 foo bar baz12345 meh=quux

will produce keywords in the index similar to "09", "14", "2012", "15", "foo", "bar", "baz12345", "meh", "quux". (It may produce others like "09/14/12" and "15:00:00", but I don't necessarily entirely grok this part)

Now, looking specifically at baz12345 - there was no segmentation between baz and 12345. Only the "whole word" is stored in the keyword index. So, to find 12345 in the keyword index, I would need to search on *12345.

Your additional searches after the | are able to take advantage of the fact that field extraction has already happened. A search of the form:

sourcetype=mylog | search some_id=789

will only check the keyword index for sourcetype=mylog, returning all events of that sourcethype. Splunk will then do field extraction, and the downstream search for some_id=789 will not go back to the keyword index again.

Some things you can do to resolve this include:

  1. Rewrite your searches as you have shown. You may have reduced search performance due to not being able to scan the keyword index for that given value.
  2. Make some_id an indexed field. This could solve the issue, but at a not small cost in terms of complexity and maintenance.
  3. Reconfigure the index-time segmentation rules to see prefix789 as two different keywords, prefix and 789. Then 789 would appear alone in the keyword index, making it searchable.

Of options #2 and #3, I'm not sure which is the lesser of the two evils, and I would not make a decision to do either one w/o discussing with Splunk support.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

dwaddle's answer is excellent. However, as of some time in 4.2-4.3, there is a better option than any of the 3 he has provided. You can configue fields.conf to suit the extraction of the suffix as a separate field with a setting as follows:

[some_id]
INDEXED_VALUE = prefix<VALUE>

This will override the default conversion of a field value as have described above. So if you look for some_id=789 that will instead convert to a search for the string prefix789, which is what you want. See the spec field for fields.conf for other options on INDEXED_VALUE.

gkanapathy
Splunk Employee
Splunk Employee

I guess if the prefix is variable, you can set the indexed field value to *<VALUE> literally. This will be a much less efficient search than with a fixed prefix (it might be as bad as simply rewriting with | search ..., but will at least work correctly and be less confusing.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

I'm not 100% sure here (only perhaps 95%), but I'm going to take a guess that this has to do with your prefix, how the base index search is run and then field extractions are applied, and index-time segmentation.

If you take your first search

sourcetype=mylog some_id=789

and look at it through the search job inspector, you will see near the top the base search that is being run against Splunk's keyword index (search for LISPY). It will probably look something (but perhaps not exactly - I'm doing this from memory) like:

[ AND sourcetype::mylog 789 ]

This means that Splunk is searching the keyword index for instances of sourcetype::mylog and instances of 789. Because of the late binding of regex field extractions, the concept of the field some_id does not exist yet. All of the events that match the base search will be run through the regex, and then once the field some_id exists, they will be further processed to find instances of some_id=789.

Now, backing up to when these events were originally indexed, Splunk will do segmentation of the raw text of the event in order to find the keywords to store in the keyword index. Certain character classes are defined as delineating keywords. With the default indexing rules, and event of the type:

09/14/2012 15:00:00 foo bar baz12345 meh=quux

will produce keywords in the index similar to "09", "14", "2012", "15", "foo", "bar", "baz12345", "meh", "quux". (It may produce others like "09/14/12" and "15:00:00", but I don't necessarily entirely grok this part)

Now, looking specifically at baz12345 - there was no segmentation between baz and 12345. Only the "whole word" is stored in the keyword index. So, to find 12345 in the keyword index, I would need to search on *12345.

Your additional searches after the | are able to take advantage of the fact that field extraction has already happened. A search of the form:

sourcetype=mylog | search some_id=789

will only check the keyword index for sourcetype=mylog, returning all events of that sourcethype. Splunk will then do field extraction, and the downstream search for some_id=789 will not go back to the keyword index again.

Some things you can do to resolve this include:

  1. Rewrite your searches as you have shown. You may have reduced search performance due to not being able to scan the keyword index for that given value.
  2. Make some_id an indexed field. This could solve the issue, but at a not small cost in terms of complexity and maintenance.
  3. Reconfigure the index-time segmentation rules to see prefix789 as two different keywords, prefix and 789. Then 789 would appear alone in the keyword index, making it searchable.

Of options #2 and #3, I'm not sure which is the lesser of the two evils, and I would not make a decision to do either one w/o discussing with Splunk support.

gkanapathy
Splunk Employee
Splunk Employee

It's safe to try the suggestion I make w/o consulting Splunk support. If it works for you, great. If it's bad (and I see no reason it should be, unless you've oversimplified your explanation, e.g., the prefix is not a just a simple string) you can just remove the fields.conf entry.

gkanapathy
Splunk Employee
Splunk Employee

Your answer is excellent. However, as of some time in 4.2-4.3, there is a better option than ay of the 3 you have provided. You can configue fields.conf to suit the extraction of the suffix as a separate field with a setting as follows:

[some_id]
INDEXED_VALUE = prefix<VALUE>

This will override the default conversion of a field value you have described above. So if you look for some_id=789 that will instead convert to a search for the string prefix789, which is what you want.

Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...