Knowledge Management

Hunk - Data Model Acceleration - Parquet files getting deleted

prvnks
New Member

I was trying out datamodel acceleration with Hunk (latest version). This is how my datamodel.conf looks:

cat etc/apps/search/local/datamodels.conf
[LVSMC]
acceleration = 1
acceleration.earliest_time = -1d
acceleration.hunk.compression_codec = snappy
acceleration.hunk.dfs_block_size = 134217728
acceleration.hunk.file_format = parquet
acceleration.manual_rebuilds = 0

It starts accelerating. But, the parquet-snappy files get deleted after collecting for around 10-20 mins. Suddenly, the parquet files disappears. May be summary creation is dropping this newly created files.

$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 08:46:56 PDT 2016
0  0  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:04:40 PDT 2016
2.5 G  7.4 G  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:05:47 PDT 2016
75.4 M  226.2 M  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test

I tried to play around with other options. It did not help.

acceleration.max_time
acceleration.backfill_time
acceleration.manual_rebuilds
acceleration.max_concurrent

Pls note that our Hunk would require around 8 hours to process entire day’s data when no other queries are fired. I don’t know how to catch up and make Hunk accelerate datamodel for 1 day data. Is there some switch that I can use to retain the parquet-snappy files? I tried to adjust earliest_time and backfill_time(much shorter than earliest_time). It did not help.

Pls let me know where it could be going wrong.

0 Karma

hsesterhenn_spl
Splunk Employee
Splunk Employee

Hi,

very old stuff but might be still a current problem...

Have you ever tried to switch the file format from "parquet" to "orc"?

parquet-hive-bundle-1.6.0.jar is f-uped...
https://issues.apache.org/jira/browse/PARQUET-246

Looks like they fixed it in 1.8.0 which has never been shipped by Splunk Core...

I have done my tests with Hadoop DMA using ORC...

Worth a try?

Good luck,

Holger

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Here is the Expected behavior – Do you see this behavior or something different?
Every 5 minutes update the DMA (Data Model Acceleration)
Every 30 minutes delete all the DMA files that are no longer valid ..
See details: http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Acceleratedatamodels — After you enable acceleration for a data model
Look at the delete action we do every 30 minutes.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

In your limits.conf file, try setting maintenance_period to something larger than the default, which is 1800 (i.e. 30 min). That default seems like it would explain the 10-20 minute lifespan you're seeing. If you change it to, say, 5400 (i.e. 90 min) do the DM files last longer? This won't fix your problem, but will help narrow down what is happening.

0 Karma
Get Updates on the Splunk Community!

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

Tuesday, May 14, 2024  |  11AM PT / 2PM ET Register to Attend Join us for this Tech Talk and learn how to ...

.conf24 | Personalize your .conf experience with Learning Paths!

Personalize your .conf24 Experience Learning paths allow you to level up your skill sets and dive deeper ...

Threat Hunting Unlocked: How to Uplevel Your Threat Hunting With the PEAK Framework ...

WATCH NOWAs AI starts tackling low level alerts, it's more critical than ever to uplevel your threat hunting ...