I have a script that is collecting data and then outputs it to a directory.
The script is being run by Splunk every hour and the directory is being continuously indexed.
After the script gets the data it does a check on a temp file to make sure there is no duplicate events and only writes new events into the indexed directory.
The problem is that the data is still indexed even though the file is not being written to.
What commands in python will make splunk index the script data directly?
I do not have any print lines and the only thing I do is open and read files and output to files.
The usual behavior of scripted inputs in Splunk is to index the STDOUT from the script. If you adapt your script to output only the new events to STDOUT, you shouldn't get that data duplication.
I have two folders called 'temp' and 'data'. When the script runs it collects some data and then does a compare with the 'temp' folder. If there is a difference it makes a list of differences and then outputs the differences to folder 'data'. It then writes all of the collected data into 'temp' to represent the latest record. If no differences are found nothing is written to the 'temp' or 'data' folder.
'temp' is just a temp folder but 'data' is the folder that is being continuously indexed.
You missed my point, which was a suggestion to re-work your script expressly to write its results to STDOUT.
In any event, let's figure out why it's grabbing your events. Does Splunk know about the location where the temp files are written? Could it be indexing those files?
When you write the new outputs, are you writing the whole thing out to a new file? Reusing an existing file?
When you say "The problem is that the data is still indexed even though the file is not being written to." What file do you mean? Can you give us some sample filenames to make the scenario easier to follow?
The issue is that I don't have STDOUT anywhere yet it is still indexing