All Apps and Add-ons

Why is Hunk slower than Splunk when it has more nodes?

jaredlaney
Contributor

(Hunk 6.2 - Am I doing something wrong?)

I can store an equivalent amount of data in Splunk with three indexers and it is much faster than using Hunk with 10 data nodes and the same amount of data.

Tags (1)
1 Solution

hsesterhenn
Path Finder

Hi,

there is a very simple explanation. HUNK and Splunk Enterprise are two very different things. 🙂

They have a common UI and almost the same search language but the storage tier is an indexed based technology for Splunk Enterprise and HDFS for HUNK.

Splunk Enterprise indexes every event: Splitting events into tokens (words), create an entry in a dictionary and store the token together with a pointer to the raw event where the token is taken from.
http://wiki.splunk.com/Community:HowIndexingWorks

If you search e.g. for "error" Splunk Enterprise just does a lookup in the dictionary and extracts the event from the GZIPed bucket.
So it's very easy to do a "needle in a haystack" search.

HUNK works with the Hadoop MapReduce framework. Data is stored on Hadoop data notes (comparable to Splunk Indexers) which provide an interface to the Hadoop HDFS file system. If you want to search data you usually use some programming language like Java (or other frameworks built on top of the original Hadoop Java libraries) and create so called "Mapreduce" jobs which are run by the data nodes.

This is a very powerful concept because data is not moved to the compute notes but executed where the data is stored.

There is some disadvantage if you just search for a single word like "error". HDFS does not use an index per se. There are addons to Hadoop to provide a SQL interface but usually your Mapreduce jobs just have to do a raw search and have to search every single event no matter whether the word "error" is included or not. It's like a brute force attack to get a password... 🙂

So instead of using a index and just read the 5 events with "error" like Splunk Enterprise HUNK has to search your 1 TB of data spread over 10 Hadoop data nodes and return the 5 events... this takes much more compute power and time.

So at the end you are comparing apples and pears if you ask "why is HUNK so much slower thank Splunk"... they are for different use cases. The really interesting thing is that you can combine Splunk indexes and HUNK virtual indexes and search data in real time with Splunk and data for analytics use cases stored on HDFS.

There are hundreds of articles about Hadoop. I like this video on Youtube:
https://www.youtube.com/watch?v=xYnS9PQRXTg
It's about 2 years old but a good start to understand basic concepts.

There nice videos regarding HUNK. E.g.:
http://de.splunk.com/view/SP-CAAAM99

Hope this helps a little bit.

Greetings,

Holger

View solution in original post

Ledion_Bitincka
Splunk Employee
Splunk Employee

All great comments, I just want to highlight that Hunk is not optimized for search (for many reasons) but rather analytics workloads - in many cases the analytics performance could be comparable to that of Splunk. That said, if your uses case is primarily search, why not send the data to Splunk indexers?

There's also ways that data can be organized and laid out, hierarchically to further optimize Hunk performance - e.g.
/some/base/path/host=abc/20150910/...

jaredlaney
Contributor

@Ledion - Would you have to separate every vix into appropriate hdfs/s3 paths?

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

No, just 1, we automatically recognize field=value pairs in paths and you can look at the docs here for how to configure Hunk to recognize timestamps. Also you can look at this blog post for a bit more info

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Holger's answer is completely correct. A few other things to keep in ming:

--There is overhead to starting up MR jobs, meaning that they may take quite some time to be queued, distributed, and completed apart from the actual work they do. This may be less than a minute, but it might be much longer if other jobs are already queued. So Hunk queries cannot be as interactive as fast Splunk Enterprise queries are.

--In some cases, queries that take a long time to run on Splunk Enterprise may be faster on Hunk, especially if many nodes are available. This depends on the query (is it a rare term search, or something that requires reading all the data?), how the indexes are managed (are you using Report Acceleration, or Data Model Acceleration?), and the format of the data (is it in a splitable format, and what is the replication factor?).

jaredlaney
Contributor

I don't think we have a problem with that. Right now, we're only running a few select jobs in dev and mostly one at a time. Good point, though.

0 Karma

hsesterhenn
Path Finder

Hi,

there is a very simple explanation. HUNK and Splunk Enterprise are two very different things. 🙂

They have a common UI and almost the same search language but the storage tier is an indexed based technology for Splunk Enterprise and HDFS for HUNK.

Splunk Enterprise indexes every event: Splitting events into tokens (words), create an entry in a dictionary and store the token together with a pointer to the raw event where the token is taken from.
http://wiki.splunk.com/Community:HowIndexingWorks

If you search e.g. for "error" Splunk Enterprise just does a lookup in the dictionary and extracts the event from the GZIPed bucket.
So it's very easy to do a "needle in a haystack" search.

HUNK works with the Hadoop MapReduce framework. Data is stored on Hadoop data notes (comparable to Splunk Indexers) which provide an interface to the Hadoop HDFS file system. If you want to search data you usually use some programming language like Java (or other frameworks built on top of the original Hadoop Java libraries) and create so called "Mapreduce" jobs which are run by the data nodes.

This is a very powerful concept because data is not moved to the compute notes but executed where the data is stored.

There is some disadvantage if you just search for a single word like "error". HDFS does not use an index per se. There are addons to Hadoop to provide a SQL interface but usually your Mapreduce jobs just have to do a raw search and have to search every single event no matter whether the word "error" is included or not. It's like a brute force attack to get a password... 🙂

So instead of using a index and just read the 5 events with "error" like Splunk Enterprise HUNK has to search your 1 TB of data spread over 10 Hadoop data nodes and return the 5 events... this takes much more compute power and time.

So at the end you are comparing apples and pears if you ask "why is HUNK so much slower thank Splunk"... they are for different use cases. The really interesting thing is that you can combine Splunk indexes and HUNK virtual indexes and search data in real time with Splunk and data for analytics use cases stored on HDFS.

There are hundreds of articles about Hadoop. I like this video on Youtube:
https://www.youtube.com/watch?v=xYnS9PQRXTg
It's about 2 years old but a good start to understand basic concepts.

There nice videos regarding HUNK. E.g.:
http://de.splunk.com/view/SP-CAAAM99

Hope this helps a little bit.

Greetings,

Holger

jaredlaney
Contributor

Holger-

Thanks for the response. Is there a hunk interface for Hbase or Cassandra so the data can be indexed?

0 Karma

tvu_splunk
Splunk Employee
Splunk Employee

We support both at external resource streaming libraries.
http://docs.splunk.com/Documentation/Hunk/latest/Hunk/StreamingLibraries

There's also an app on Splunkbase for Cassandra:
https://splunkbase.splunk.com/app/2668/

hsesterhenn
Path Finder

I haven't tried it myself but maybe because HBase provides a SQL interface the DB Connect App might help?!?!

Looks like an option:

http://answers.splunk.com/answers/173460/does-hunk-read-data-from-hbase.html

HTH,

Holger

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...