Solved: TERM and PREFIX cannot find string with two dashes

PavelP · ‎03-02-2024

any ideas on TERM and PREFIX limitations with double dashes?

cat /tmp/test.txt
abc//xyz
abc::xyz
abc==xyz
abc@@xyz
abc..xyz
abc--xyz
abc$$xyz
abc##xyz
abc%%xyz
abc\\xyz
abc__xyz

search abc--xyz # works
TERM(abc--xyz) # doesn't work
TERM(abc*) # works
| tstats count by PREFIX(abc) # doesn't work for abc--xyz

Both TERM and PREFIX work with other minor segmenters like dots or underscores.

tscroggins · ‎03-02-2024

Hi @PavelP,

This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.

We can use walklex to list terms in our index:

| walklex index=main type=term
| table term

We'll find the following:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:

[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:

$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0

We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:

# segmenters.conf

[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

# props.conf

[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt

Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).

With the new configuration in place, walklex should return abc--xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

We can now use TERM and PREFIX as expected:

| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)

abc--	vals
xyz	xyz

As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.

View solution in original post

PavelP · ‎03-02-2024

Hey folks, breaking news for the TERM/PREFIX enthusiasts! Brace yourselves – our TERM searches cannot find punycode encoded domains!

🙂

xn--bcher-kva.de

https://en.m.wikipedia.org/wiki/Punycode

tscroggins · ‎03-02-2024

Now that is a valid use case for modifying segmentation; however, the impact is wide-reaching. You may also want to look at setting INTERMEDIATE_MAJORS = true, although that could result in a significant indexing performance impact.

Which access log formats and source types do you most commonly use?

PickleRick · ‎03-03-2024

+1 on that. The impact is limited to where you use the custom segmenter (you set it for specific props stanza).

PavelP · ‎03-03-2024

Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

Working URL: hp--community.force.com
Path: /tmp/folder--xyz/test-----123.txt, c:\Windows\Temp\test---abc\abc--123.dat
Username: admin--haha
User-Agent: Mozilla/5.0--findme

PavelP · ‎03-04-2024

please consider upvote a new Splunk idea to get more attention: https://ideas.splunk.com/ideas/EID-I-2226

tscroggins · ‎03-04-2024

A custom segmenter has merit in this case, but globally, folks will recommend tagging events with an appropriate add-on (or custom configuration) and using an accelerated Web or other data model to find matching URLs, hostnames, etc.

PavelP · ‎03-03-2024

Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any domain with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

Working URL: https://hp--community.force.com
Path: /tmp/folder--xyz/test-----123.txt, c:\Windows\Temp\test---abc\abc--123.dat
Username: admin--haha
User-Agent: Mozilla/5.0--findme

tscroggins · ‎03-04-2024

I missed your comment re: accelerated data models earlier. The field values should be available at search time, either from _raw or tsidx, and then stored in the summary index. Off the top of my head, I don't know if the segmenters impact INDEXED_EXTRACTIONS = w3c, but they shouldn't impact transforms-based indexed extractions or search-time field extractions from other source types.

PavelP · ‎03-03-2024

Affected are tstats/TERM/PREFIX searches and accelerated DM searches. I haven't conducted a thorough check yet, but it seems that searches on accelerated DM may overlook fields with double dashes. This isn't limited to punycode domains; any field value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

Working URL: https://hp--community.force.com
Path: /tmp/back--door/test-----backdoor.txt, c:\Windows\Temp\back--door\test---backdoor.exe
Username: admin--backdoor
User-Agent: Mozilla/5.0--backdoor

tscroggins · ‎03-02-2024

Hi @PavelP,

This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.

We can use walklex to list terms in our index:

| walklex index=main type=term
| table term

We'll find the following:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:

[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:

$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0

We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:

# segmenters.conf

[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

# props.conf

[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt

Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).

With the new configuration in place, walklex should return abc--xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

We can now use TERM and PREFIX as expected:

| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)

abc--	vals
xyz	xyz

As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.

PickleRick · ‎03-02-2024

Nice one. I even checked the specs for segmenters.conf and while I noticed the single dash as minor segmenter, I completely missed the double dash. (Though it is "hidden" relatively far in the default declaration and surounded by all those other entities).

tscroggins · ‎03-02-2024

Somewhere in Splunk history, there's a developer who did the lexicographically correct thing knowing it would stymy future Splunkers. Let's raise a glass to the double oblique hyphen (thanks, Wikipedia)!

PickleRick · ‎03-02-2024

Double oblique hyphen is U+2E17 and looks like this: ⸗

tscroggins · ‎03-02-2024

It was just a Wikipedia joke: "In Latin script, the double hyphen ⹀ is a punctuation mark that consists of two parallel hyphens. It was a development of the earlier double oblique hyphen ...." I'm assuming an early developer analyzed a suitable corpus of log content and determined a double hyphen or long dash should be considered a major breaker.

PickleRick · ‎03-02-2024

Well, double hyphen is really a poor-man's approximation of an em-dash or en-dash and I don't recall seeing them outside of TeX sources so I was pretty surprised to find it in segmenters.

Anyway, punctuation is not a part of the script. Many languages using latin script use (slightly) different punctuation systems and languages using different scripts (like cyryllic) use very similar punctuation 😛

But we're drifting heavily off-topic.

isoutamo · ‎03-02-2024

Nice explanation and nice way to get values to work with tstats!

PickleRick · ‎03-02-2024

~~That is interesting. Didn't have oportunity to test it but if it is so, it looks like a support case material.~~

See my other reply.

TERM and PREFIX cannot find string with two dashes

other

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases