Solved: Search capabilities of splunk - How powerful is it...

wajihullahbaig · ‎03-22-2012

I am new to splunk. Just 3 odd days at it. I have been using Lucene for indexing and searching raw data in forms of fielded and un-fielded data. I am very much impressed with lucenes performance for searching. I was wondering if the experience community can guide me here on a few capabilities of splunk. Specifically in comparison of splunk with respect to what I already know about Lucene. Not just limited to search.

How does splunk handle stop words? Words that are very common such a a,the,is... which we can provide manually to lucene.
Does splunk peform wildcard searches, proximity searches, regex searches? I know it can do fielded searches?
Optimizations on indices. Specially compression.
Is it possible to do Fuzzy, synonym based searches on splunk?

I know this must be a length question but definitely would like to know some points from experienced people on splunk.

Thank you.

Stephen_Sorkin · ‎03-22-2012

This is potentially a very long discussion of the differences between Splunk, which seeks to index time-series, machine generated data, and Lucene, which was originally designed to index human-generated text documents. We can begin with your questions.

Splunk has no notion of stop words. By default, Splunk indexes all keywords found in events, as defined by the segmentation rules.
Splunk provides wildcard searches and phrase searches, but the index doesn't provide native proximity searches or regex searches. For those, we rely on subsequent commands in the search processing pipeline.
Splunk aggressively compresses the rawdata we store, and we spend a lot of effort to make the indexes as small as possible, by means of explicit compression and other low footprint data structures. Typically, you can expect that the rawdata will be 10% the size of the original data and the indexes are 20-40% of the size of the original data, depending on entropy. Together Splunk typically requires 30-50% the size of the original raw data as storage.
The index itself doesn't provide synonym support, since that's fundamentally a problem for human text. We provide an analogous concept however, in eventtypes, which can be used to represent meaningful classes of queries, including synonyms.

View solution in original post

Stephen_Sorkin · ‎03-22-2012

This is potentially a very long discussion of the differences between Splunk, which seeks to index time-series, machine generated data, and Lucene, which was originally designed to index human-generated text documents. We can begin with your questions.

Splunk has no notion of stop words. By default, Splunk indexes all keywords found in events, as defined by the segmentation rules.
Splunk provides wildcard searches and phrase searches, but the index doesn't provide native proximity searches or regex searches. For those, we rely on subsequent commands in the search processing pipeline.
Splunk aggressively compresses the rawdata we store, and we spend a lot of effort to make the indexes as small as possible, by means of explicit compression and other low footprint data structures. Typically, you can expect that the rawdata will be 10% the size of the original data and the indexes are 20-40% of the size of the original data, depending on entropy. Together Splunk typically requires 30-50% the size of the original raw data as storage.
The index itself doesn't provide synonym support, since that's fundamentally a problem for human text. We provide an analogous concept however, in eventtypes, which can be used to represent meaningful classes of queries, including synonyms.

Search capabilities of splunk - How powerful is it really?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life