Solved: How to remove duplicate data and how to prevent ha...

edrivera3 · ‎10-30-2015

Hi

I have many configuration text file which basically looks like this:
Owner Name: AAAAA AAAAA
Product Name: AAAA AAAA
Product ID: NNNNN-NN Serial ID: NN-NN-NN-NNNNN

Sometimes there is change in the product ID or serial ID and I want to index the new change but I don't want to keep old event. Basically, I want to replace the old configuration file with the new one.

I tried the below inputs.conf because some files where are not getting index because of the similarity between them. Every thing was fine until I found out that every time there is change in the configuration text file, the file is index but it doesn't replace the old one. So now I have multiple configuration files with the same source which is a problem.

[Monitor://Some directory]
index = my_index
sourcetype = my_sourcetype
crcSalt = <SOURCE>

(1) Right now I need to delete all events that have already a new version of them based on the _indextime.
(2) I need a new inputs.conf setup that will prevent this behavior.

edrivera3 · ‎10-30-2015

This solution comes basically from wookcook's idea.

I eliminate the

| sort - indextime

because it wasn't running correctly for me.
I replaced the _raw field for two fields in my data and the indextime field.

... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | search index=* NOT [...| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | dedup source | fields field1, field2 ,indextime]

View solution in original post

sloshburch · ‎11-06-2015

Have you tried not deleting old results but instead just searching for the latest results? Something like:

... | eval _time=(_indextime) | stats latest(*) by source

edrivera3 · ‎10-30-2015

This solution comes basically from wookcook's idea.

I eliminate the

| sort - indextime

because it wasn't running correctly for me.
I replaced the _raw field for two fields in my data and the indextime field.

... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | search index=* NOT [...| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | dedup source | fields field1, field2 ,indextime]

woodcock · ‎10-30-2015

OK, try this then:

... NOT [... | sort - _indextime | dedup source | fields _raw]

If this looks correct (has only the bad stuff), then it should be safe to pipe this to | delete.

edrivera3 · ‎10-30-2015

It is very interesting your idea, but it didn't work and I am not sure why.

The right side search provides all the good data, so basically the Boolean operator NOT should eliminate the good data from the total data leaving only the bad data.

woodcock · ‎10-30-2015

What is the result of this search?

... | sort - _indextime | dedup source | fields _raw | format

It should have 1 field called search that has a list of OR on _raw.

edrivera3 · ‎10-30-2015

This search provided all the right data. If I look in Statistics I have a table with one row:

_raw                           search
                                    NOT ()

woodcock · ‎10-30-2015

That's all I've got. Play around and see if you can make it work and update this Q&A with what you find.

richgalloway · ‎10-30-2015

The first part of the query can be as simple as index=foo sourcetype=my_sourcetype.

---
If this reply helps you, Karma would be appreciated.

richgalloway · ‎10-30-2015

Try this instead.

index=foo sourcetype=my_sourcetype | eval oldest=relative_time(now(),"-1d@d") | where _indextime<oldest

Adjust the arguments to relative_time as needed.

---
If this reply helps you, Karma would be appreciated.

edrivera3 · ‎10-30-2015

I don't see how this solve my problem. Could you elaborate or explain more your solution?
now()= is the time when the search started
oldest = is 1 day before the search request started

Basically I need a solution that provides the same results than woodcock's solution but without using eventstat.

woodcock · ‎10-30-2015

This should be the opposite of dedup:

... | eventstats max(_indextime) AS latestIndexTime by source | where _indextime<latestIndexTime

Then you just pipe that to delete by adding this:

... | delete

edrivera3 · ‎10-30-2015

Your command is perfect for selecting the events, but I encountered the following error when added the delete command.
Error in 'delete' command: This command cannot be invoked after the non-streaming command 'eventstats'.
The search job has failed due to an error. You may be able view the job in the Job Inspector.

I am going to retry to run the command, but for some reason it takes so much time to run.

edrivera3 · ‎10-30-2015

I got the same error. My roles are can_delete, user, power.

woodcock · ‎10-30-2015

My solution will not work then, because evidently the use of eventstats precludes the use of delete (which, IMHO, is definitely a bug).

richgalloway · ‎10-30-2015

Splunk does not have a "index this only if it's not already indexed" feature. The performance of such feature probably would be poor. Nor will it replace or update anything already indexed.
You can remove duplicate data (or any data) by piping a search to the delete command.

---
If this reply helps you, Karma would be appreciated.

edrivera3 · ‎10-30-2015

Ok. That's too bad. But how to make Splunk delete events that has a new version of them? I know about the delete command, but I haven't been successful to select the appropriate data. With the below command I've been able to see the indextime and which files have more than two files for the same source.
... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | stats count by source| where count>1

edrivera3 · ‎10-30-2015

The only way I have found is deleting file at a time which is very inefficient.

richgalloway · ‎10-30-2015

You need to have permission to use the delete command. That's the best way to remove events from Splunk.

---
If this reply helps you, Karma would be appreciated.

edrivera3 · ‎10-30-2015

I have permission to use the delete command, the problem is that I don't know how to select the appropriate data for deleting.

woodcock · ‎10-30-2015

See my Answer!

How to remove duplicate data and how to prevent having duplicate data?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!