Knowledge Management

Hunk - Data Model Acceleration - Parquet files getting deleted

prvnks
New Member

I was trying out datamodel acceleration with Hunk (latest version). This is how my datamodel.conf looks:

cat etc/apps/search/local/datamodels.conf
[LVSMC]
acceleration = 1
acceleration.earliest_time = -1d
acceleration.hunk.compression_codec = snappy
acceleration.hunk.dfs_block_size = 134217728
acceleration.hunk.file_format = parquet
acceleration.manual_rebuilds = 0

It starts accelerating. But, the parquet-snappy files get deleted after collecting for around 10-20 mins. Suddenly, the parquet files disappears. May be summary creation is dropping this newly created files.

$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 08:46:56 PDT 2016
0  0  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:04:40 PDT 2016
2.5 G  7.4 G  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test
$ date;/usr/bin/hadoop fs -du -h /abcd/SplunkMR/datamodel
Wed Jul  6 09:05:47 PDT 2016
75.4 M  226.2 M  /abcd/SplunkMR/datamodel/70F888CB-CA73-4A97-B54F-6B0ACA9A4E7E_DM_search_test

I tried to play around with other options. It did not help.

acceleration.max_time
acceleration.backfill_time
acceleration.manual_rebuilds
acceleration.max_concurrent

Pls note that our Hunk would require around 8 hours to process entire day’s data when no other queries are fired. I don’t know how to catch up and make Hunk accelerate datamodel for 1 day data. Is there some switch that I can use to retain the parquet-snappy files? I tried to adjust earliest_time and backfill_time(much shorter than earliest_time). It did not help.

Pls let me know where it could be going wrong.

0 Karma

hsesterhenn_spl
Splunk Employee
Splunk Employee

Hi,

very old stuff but might be still a current problem...

Have you ever tried to switch the file format from "parquet" to "orc"?

parquet-hive-bundle-1.6.0.jar is f-uped...
https://issues.apache.org/jira/browse/PARQUET-246

Looks like they fixed it in 1.8.0 which has never been shipped by Splunk Core...

I have done my tests with Hadoop DMA using ORC...

Worth a try?

Good luck,

Holger

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Here is the Expected behavior – Do you see this behavior or something different?
Every 5 minutes update the DMA (Data Model Acceleration)
Every 30 minutes delete all the DMA files that are no longer valid ..
See details: http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Acceleratedatamodels — After you enable acceleration for a data model
Look at the delete action we do every 30 minutes.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

In your limits.conf file, try setting maintenance_period to something larger than the default, which is 1800 (i.e. 30 min). That default seems like it would explain the 10-20 minute lifespan you're seeing. If you change it to, say, 5400 (i.e. 90 min) do the DM files last longer? This won't fix your problem, but will help narrow down what is happening.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...