Monitoring Splunk

How to improve Hunk performance when accessing Hive tables with many small orc files?

tsunamii
Path Finder

We are seeing excessively slow performance when accessing Hive tables with many small orc files.

We are looking for ways to improve performance. From what we can see, Hunk is causing Hadoop to create many thousands of mappers. This is because each individual orc file is causing Hadoop to create a unique search job for each file. It can take hours to even get one panel on a dashboard to populate.

Does Hunk use the CombineFileInputFormat API? This seems it would allow for relief of the number of Mappers that are generated in order to complete a search.

0 Karma
1 Solution

rdagan_splunk
Splunk Employee
Splunk Employee

Hunk does not determines how many mappers are going to run. Hunk submits the job and Hadoop determines how many mappers (map task attempts) are going to run. As you can see for each ORC file hadoop creates a new Map Job.
Few options to fix this issue:
1) Ask the people who creates the ORC files to make sure they are larger (For example, 127MB per file or larger)
2) Lower the maxsplits flag. By default vix.splunk.search.mr.maxsplits = 10000. That means Hunk process up to 10000 ORC files per job. Lowering this value, lets say to 5,000 will create more Jobs, but each job will process less file. That will lower the overhead of individual Hunk Map Jobs.
3) You can manipulate any Hadoop client flag as long as you add vix before the flag (for example, vix.mapreduce.job.jvm.numtasks = 100)

View solution in original post

samnik60
New Member

Hunk does have control on number of mapper by using the desired inputformat which controls the number of splits which in turns controls the number of mappers. As per the source code of splunk it does not have combineFileInputFormat support in it, so unless Hunk adapts it in its Code , we wont be getting this feature.

Hunk should seriously consider adding this feature , as small files are obvious when we do batch load in small intervals. This small file problem has been solved in hadoop using combineFileInputFormat, In hive using combinehiveinputformat. It should be simple and optimal if hunk can adapt this.

Preparing the data to be large file is not a good option.

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Hunk does not determines how many mappers are going to run. Hunk submits the job and Hadoop determines how many mappers (map task attempts) are going to run. As you can see for each ORC file hadoop creates a new Map Job.
Few options to fix this issue:
1) Ask the people who creates the ORC files to make sure they are larger (For example, 127MB per file or larger)
2) Lower the maxsplits flag. By default vix.splunk.search.mr.maxsplits = 10000. That means Hunk process up to 10000 ORC files per job. Lowering this value, lets say to 5,000 will create more Jobs, but each job will process less file. That will lower the overhead of individual Hunk Map Jobs.
3) You can manipulate any Hadoop client flag as long as you add vix before the flag (for example, vix.mapreduce.job.jvm.numtasks = 100)

Get Updates on the Splunk Community!

Observability | Use Synthetic Monitoring for Website Metadata Verification

If you are on Splunk Observability Cloud, you may already have Synthetic Monitoringin your observability ...

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

Tuesday, May 14, 2024  |  11AM PT / 2PM ET Register to Attend Join us for this Tech Talk and learn how to ...

.conf24 | Personalize your .conf experience with Learning Paths!

Personalize your .conf24 Experience Learning paths allow you to level up your skill sets and dive deeper ...