Monitoring Splunk

What are the merged_lexicon.lex files in my buckets?

Chris_R_
Splunk Employee
Splunk Employee

What is the purpose of these files? Some get to be quite large

1179300775 Feb 9 16:45 merged_lexicon.lex

Tags (1)
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

View solution in original post

the_wolverine
Champion

If you don't need typeahead and are looking to save some space on your Splunk partition, deleting these files can save you about 10% on your total index size.

the_wolverine
Champion

Apparently they can take up anywhere from 5%-20%

0 Karma

jrodman
Splunk Employee
Splunk Employee

I think there's been some optimization to the merged_lexicon files. They're currently under 5% for me.

0 Karma

jrodman
Splunk Employee
Splunk Employee

In a bit more detail, a tsidx file consists of two parts: a lexicon, and a set of postings. The lexicon is a list of terms in alpha order, followed by a pointer to its posting list. The posting list is a mapping for that term, to which events (in the rawdata files) contain that term.

So essentially you have, something like this:

tsidxfile 1:
leixcon: a  b  c 
          |  |  | 
          |  |  +-+
          |  ++   |
          V   v   v
postings: 2 4|1 5|2

tsidxfile 2: (smaller)
leixcon: d 
          |
          V
postings: 2 8

The lexicon tells us what terms exist and the postings tell us where to find them. However, we have to look in every tsidx file to find out all the terms. So if there are 20 tsidx files and you type in 'gromblhyozorktooks', which doesn't exist, splunkd has to open all 20 tsidx files to figure out you're crazy.

The merged_lexicon.lex is just a file to contain all the lexicons, which are much smaller, it looks more like this:

a b c d 

This allows typeahead to answer its questions much more quickly (what terms exist), and allows negative lookups to fail much faster. The typical case for this is that some buckets have your term, and some do not, so the merged lexicon allows buckets to be completely ruled out much faster.

pradeepkr13
Engager

Isn't that, what you just described, a bloomfilter file and not lexicon?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...