We are going through some compliance stuff and I need to ensure that our data integrity is true. How would I go about doing this on a virtual index? We are using hadoop to read the data from s3 in AWS.
You really can't use the Splunk Data Integrity feature with Hadoop virtual index data. Splunk did not process the data through its own indexing pipeline, so you can't use the Data Integrity features of the indexing pipeline. You would have to make your own file hashing and signing solution for HDFS and make sure it crosses all of your compliance check boxes.
Well, the data did go through splunk first. We are rolling our data off and using AWS s3 as our archiving solution and using analytics for hadoop to read through splunk.
Are you aware of any file hashing components for hdfs?
OH well that changes it a little bit I guess! If the data went through Splunk's Data Integrity features as it was indexed before it wound up in HDFS, then the hashes made when the bucket rolled from hot to warm should still be valid on the copy of the bucket in s3. The question becomes if there is a way to check those. And I'm honestly not sure.