Getting Data In

Splunk archive app: need advice on script to cleanup Hadoop data

tsunamii
Path Finder

We are now using Splunk archiving. I understand that there is no mechanism to delete the Hadoop Splunk data that has been archived. I would like to write a general script for deletion based on date (e.g. might want to delete data more than 60 days.)

Here is a sample archived directory with the timestamps and identify directories to be deleted that are older than n days. There are timestamps on the directory names. Would I recurse down to the directory that has the journal.gz e.g. 1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz and then check the timestamps of 1440973083 and 1439867820 and if these were OLDER than n days ago delete the directory and files: db_1440973083_1439867820_1/journal.gz or what? Please advise.

    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1 
    -rw------- 3 splunk splunk 117 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.content-md5_9ff9fb525c137adf5aac9184b62a22f2.receipt 
    -rw------- 3 splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.valid 
    -rw------- 3 splunk splunk 14002 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/bucket-metadata.seq 
    -rw------- 3 splunk splunk 2507794 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz

kschon_splunk
Splunk Employee
Splunk Employee

Yes, that sounds basically correct. The timestamps in the directory name are [latest time]_[earliest time], so you only need to check the first one. Note that these times refer to the events in that bucket. The date the bucket was archived might have been significantly later. Also, you may want to delete any higher-level directories that are empty after you delete the buckets, both to conserve HDFS inodes, and to make Hunk split-generation marginally faster.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...