What is the purpose of these files? Some get to be quite large
1179300775 Feb 9 16:45 merged_lexicon.lex
These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.
If you don't need typeahead and are looking to save some space on your Splunk partition, deleting these files can save you about 10% on your total index size.
Apparently they can take up anywhere from 5%-20%
I think there's been some optimization to the merged_lexicon files. They're currently under 5% for me.
In a bit more detail, a tsidx file consists of two parts: a lexicon, and a set of postings. The lexicon is a list of terms in alpha order, followed by a pointer to its posting list. The posting list is a mapping for that term, to which events (in the rawdata files) contain that term.
So essentially you have, something like this:
tsidxfile 1: leixcon: a b c | | | | | +-+ | ++ | V v v postings: 2 4|1 5|2 tsidxfile 2: (smaller) leixcon: d | V postings: 2 8
The lexicon tells us what terms exist and the postings tell us where to find them. However, we have to look in every tsidx file to find out all the terms. So if there are 20 tsidx files and you type in 'gromblhyozorktooks', which doesn't exist, splunkd has to open all 20 tsidx files to figure out you're crazy.
The merged_lexicon.lex is just a file to contain all the lexicons, which are much smaller, it looks more like this:
a b c d
This allows typeahead to answer its questions much more quickly (what terms exist), and allows negative lookups to fail much faster. The typical case for this is that some buckets have your term, and some do not, so the merged lexicon allows buckets to be completely ruled out much faster.
Isn't that, what you just described, a bloomfilter file and not lexicon?
These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.