Getting Data In

Splunk archive app: need advice on script to cleanup Hadoop data

tsunamii
Path Finder

We are now using Splunk archiving. I understand that there is no mechanism to delete the Hadoop Splunk data that has been archived. I would like to write a general script for deletion based on date (e.g. might want to delete data more than 60 days.)

Here is a sample archived directory with the timestamps and identify directories to be deleted that are older than n days. There are timestamps on the directory names. Would I recurse down to the directory that has the journal.gz e.g. 1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz and then check the timestamps of 1440973083 and 1439867820 and if these were OLDER than n days ago delete the directory and files: db_1440973083_1439867820_1/journal.gz or what? Please advise.

    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000 
    drwx------ - splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1 
    -rw------- 3 splunk splunk 117 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.content-md5_9ff9fb525c137adf5aac9184b62a22f2.receipt 
    -rw------- 3 splunk splunk 0 2015-09-12 22:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/archive.valid 
    -rw------- 3 splunk splunk 14002 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/bucket-metadata.seq 
    -rw------- 3 splunk splunk 2507794 2015-09-12 21:38 /projects/splunk/archive/036F3BB1-486F-4225-B461-521174F8B918/1440460800_1437696000/1440460800_1437696000/db_1440973083_1439867820_1/journal.gz

kschon_splunk
Splunk Employee
Splunk Employee

Yes, that sounds basically correct. The timestamps in the directory name are [latest time]_[earliest time], so you only need to check the first one. Note that these times refer to the events in that bucket. The date the bucket was archived might have been significantly later. Also, you may want to delete any higher-level directories that are empty after you delete the buckets, both to conserve HDFS inodes, and to make Hunk split-generation marginally faster.

Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...