Splunk Search

Remove segmenters in search (lispy) for Norwegian characters

hettervik
Builder

Hi folks. Whenever you do a search in Splunk you can review the lispy in search.log. For example, if I search for my own username in the main index, the search would look like this index=main hettervi while the lispy would look like this [AND index::main hettervi]. However, since when I'm using Norwegian characters æ, ø and å the words gets segmentet in the lipsy. For example, if I search for the (fictional) Norwegian name "Hælgøvoll" the search would look like this index=main hælgøvoll, but the lipsy would look like this [AND index::main h lg voll æ ø]. See the problem?

I've looked through the documentation for segmenters.conf, but as far as I can see there is no mention of Norwegian characters. Anyone got any tips for how to unlist the Norwegian characters as breakers, both at index time and in search time?

0 Karma

hettervik
Builder

I've looked into the case some more. An interesting observation is that searching for TERM(hælgøvoll) or TERM(h*lg*voll) gives no results. This lead me to believe that the Norwegian characters æ, ø and å are defined as major breakers. However, if this was the case, they wouldn't be listed in the lispy as showed in my initial question. The only explanation I can come up with that explains the observed behavior is that there are some "hidden" major breakers before and after the Norwegian characters æ, ø and å. I'm not sure if I'm correct in my assumption, and if this is a bug or a feature.

0 Karma

ddrillic
Ultra Champion

@hettervi, you need to look at the encoding. UTF-8, for example, as an implementation of Unicode, covers all known languages.

A good place to start is at - Configure character set encoding

0 Karma

hettervik
Builder

Perhaps. I'll look into it, though the problem isn't that the characters aren't supported, it is that the search head segments the searched words whenever the said characters occur. As far as I know, the generated lispy for a search isn't sourcetype dependent.

0 Karma

malvidin
Communicator

Yes, it appears that most (if not all) non-ASCII character are major breakers.

The lispy I see for a simple search for  тестирование is:

[ AND index::main а в е и н о р с т ]

This is a bigger issue if the data is ingested in ASCII JSON format.

{"data": "\u0442\u0435\u0441\u0442\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435"}

 If the data above is ingested, "data="тестирование"  or "тестирование" will not find the data.  An initial search like "u04*" must be included.  A similar issue occurs when the raw JSON includes a newline, as a string like {"data": "line_one\nline_two"} cannot be found with a search for "line_two".

Tags (3)
0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...