We are trying to export data to Hadoop from an index that is not indexed in real time and where the indexing time is that of the event. As the data is received with delay in Splunk we are interested in exporting the data from a time window with an offset from the export time to guarantee that the data are indexed in Splunk and therefore we do not lose data in the export.
We are on Splunk Enterprise v7.2.4.2 and Hadoop Connect v1.2.5 and want to export data to Cloudera (cdh5.15.1)
What we have tried is to schedule an hourly export forcing the time window with offset on the SPL with the earliest and latest. The configured export parameters are:
SPL:
index=data_index_to_export earliest=-7h@h latest=-6h@h
| “some transformations”
| fields *, host, sourcetype, source, date*, field1, field2, field3, field4, field5, field6, field7, field8
Format: csv
Fields: field1, field2, field3, field4, field5, field6, field7, field8
HDFS Cluster: cloudera_cluster
HDFS base path: cloudera_base_path
Partition by: Date and Hour
Export from: 13/03/2019
Export frequency: Every hour
Parallel searches: 1
Compression level: 9
The Schedule has been configured before 13h so that the first execution has run at 13h.
We see in the export job that Hadoop Connect selects the time range of events to export adding at the beginning of the SPL:
(_indextime=15524316* OR _indextime=15524317* OR _indextime=15524318* OR _indextime=15524319* OR _indextime=1552432* OR _indextime=1552433* OR _indextime=1552434* OR _indextime=1552435* OR _indextime=1552436* OR _indextime=1552437* OR _indextime=1552438* OR _indextime=1552439* OR _indextime=155244* OR _indextime=155245* OR _indextime=155246* OR _indextime=1552470* OR _indextime=1552471* OR _indextime=1552472* OR _indextime=1552473* OR _indextime=1552474* OR _indextime=1552475* OR _indextime=1552476* OR _indextime=1552477* OR _indextime=15524780* OR _indextime=15524781* OR _indextime=15524782* OR _indextime=155247830* OR _indextime=155247831* OR _indextime=155247832* OR _indextime=155247833* OR _indextime=1552478340 OR _indextime=1552478341 OR _indextime=1552478342 OR _indextime=1552478343 OR _indextime=1552478344 OR _indextime=1552478345 OR _indextime=1552478346 OR _indextime=1552478347 OR _indextime=1552478348)
And over this applies the earliest=-7h@h latest=-6h@h restricting the 13h time range (from 13/03/2019 00:00:00 to 13/03/2019 13:00:00) to 1h
So, the first execution of the schedule works fine and exports the desired hour of data because the 1h time window falls inside the 13h time window selected by Hadoop Connect.
The problem comes in the second (and next) execution when the selected time range by Hadoop Connect is the next hour (13/03/2019 13:00:00 to 13/03/2019 14:00:00) and where the configured SPL is added with the earliest=-7h@h latest=-6h@h, so we are trying to select a time range that is out of the time range previously selected by Hadoop Connect and consequently it doesn’t export data.
My question is how we can schedule a periodic export that takes 1h of data deferred X hours from the export execution time.
... View more