Solved: Re: Good way to ingest a pile of historical data?

chris_jepeway · ‎05-20-2019

I've got a fair amount of historical data I need to index as part of migrating to a newly built cluster.

Is what's described in the Community Wiki entry on Understanding Buckets under Another example, this time with historical data still the way to go?

That is, should I load the historical data into a separate index to prevent the "bucket spread" I might expect from mixing current data and old data into the same index?

Or is it safe to just load data from a year ago (say) into the same index where I'm loading my current, real-time logs?

sduff_splunk · ‎05-20-2019

Depending on how much volume, you are probably better off storing the old and new data separately.

If you ingest the old and new data at the same time, the buckets that are created will have a very large start and end timestamp. When you perform a Splunk search, it will use the timestamps initially to decide if a particular bucket is relevant. As the timerange is big, these buckets may be relevant often, so further checks will need to be performed to see if it has matching events. If the timerange was smaller, these non-required buckets would be immediately discarded, so search performance would be improved.

This also impacts the rolling of buckets to Frozen. Buckets will only roll when the newest event in the bucket is older than the retention period. So if you are mixing new events with old events, the old events cannot roll if the bucket they are stored in has an event that is recent.

View solution in original post

pruthvikrishnap · ‎05-20-2019

Hi Chris,

I would suggest you to index it the regular way unless there is very huge amount of data which will impact your license, the only thing which i would take care is to ensure the time stamp which splunk is considering. Ensure that splunk is not using the indexed time which is mess up your data which is currently being indexed.
In this scenario all the data will first hit the hot bucket and basing on the bucket settings this historic data will roll over to concerned buckets accordingly.

sduff_splunk · ‎05-20-2019

Depending on how much volume, you are probably better off storing the old and new data separately.

If you ingest the old and new data at the same time, the buckets that are created will have a very large start and end timestamp. When you perform a Splunk search, it will use the timestamps initially to decide if a particular bucket is relevant. As the timerange is big, these buckets may be relevant often, so further checks will need to be performed to see if it has matching events. If the timerange was smaller, these non-required buckets would be immediately discarded, so search performance would be improved.

This also impacts the rolling of buckets to Frozen. Buckets will only roll when the newest event in the bucket is older than the retention period. So if you are mixing new events with old events, the old events cannot roll if the bucket they are stored in has an event that is recent.

Good way to ingest a pile of historical data?

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases