Getting Data In

How to remove duplicate data and how to prevent having duplicate data?

edrivera3
Builder

Hi

I have many configuration text file which basically looks like this:
Owner Name: AAAAA AAAAA
Product Name: AAAA AAAA
Product ID: NNNNN-NN Serial ID: NN-NN-NN-NNNNN

Sometimes there is change in the product ID or serial ID and I want to index the new change but I don't want to keep old event. Basically, I want to replace the old configuration file with the new one.

I tried the below inputs.conf because some files where are not getting index because of the similarity between them. Every thing was fine until I found out that every time there is change in the configuration text file, the file is index but it doesn't replace the old one. So now I have multiple configuration files with the same source which is a problem.

[Monitor://Some directory]
index = my_index
sourcetype = my_sourcetype
crcSalt = <SOURCE>

(1) Right now I need to delete all events that have already a new version of them based on the _indextime.
(2) I need a new inputs.conf setup that will prevent this behavior.

0 Karma
1 Solution

edrivera3
Builder

This solution comes basically from wookcook's idea.

I eliminate the

| sort - indextime
because it wasn't running correctly for me.
I replaced the _raw field for two fields in my data and the indextime field.

... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | search index=* NOT [...| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | dedup source | fields field1, field2 ,indextime]

View solution in original post

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Have you tried not deleting old results but instead just searching for the latest results? Something like:

... | eval _time=(_indextime) | stats latest(*) by source
0 Karma

edrivera3
Builder

This solution comes basically from wookcook's idea.

I eliminate the

| sort - indextime
because it wasn't running correctly for me.
I replaced the _raw field for two fields in my data and the indextime field.

... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | search index=* NOT [...| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | dedup source | fields field1, field2 ,indextime]

0 Karma

woodcock
Esteemed Legend

OK, try this then:

... NOT [... | sort - _indextime | dedup source | fields _raw]

If this looks correct (has only the bad stuff), then it should be safe to pipe this to | delete.

edrivera3
Builder

It is very interesting your idea, but it didn't work and I am not sure why.

The right side search provides all the good data, so basically the Boolean operator NOT should eliminate the good data from the total data leaving only the bad data.

0 Karma

woodcock
Esteemed Legend

What is the result of this search?

... | sort - _indextime | dedup source | fields _raw | format

It should have 1 field called search that has a list of OR on _raw.

0 Karma

edrivera3
Builder

This search provided all the right data. If I look in Statistics I have a table with one row:

_raw                           search
                                    NOT ()
0 Karma

woodcock
Esteemed Legend

That's all I've got. Play around and see if you can make it work and update this Q&A with what you find.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The first part of the query can be as simple as index=foo sourcetype=my_sourcetype.

---
If this reply helps you, Karma would be appreciated.
0 Karma

richgalloway
SplunkTrust
SplunkTrust

Try this instead.

index=foo sourcetype=my_sourcetype | eval oldest=relative_time(now(),"-1d@d") | where _indextime<oldest

Adjust the arguments to relative_time as needed.

---
If this reply helps you, Karma would be appreciated.
0 Karma

edrivera3
Builder

I don't see how this solve my problem. Could you elaborate or explain more your solution?
now()= is the time when the search started
oldest = is 1 day before the search request started

Basically I need a solution that provides the same results than woodcock's solution but without using eventstat.

0 Karma

woodcock
Esteemed Legend

This should be the opposite of dedup:

... | eventstats max(_indextime) AS latestIndexTime by source | where _indextime<latestIndexTime

Then you just pipe that to delete by adding this:

... | delete
0 Karma

edrivera3
Builder

Your command is perfect for selecting the events, but I encountered the following error when added the delete command.
Error in 'delete' command: This command cannot be invoked after the non-streaming command 'eventstats'.
The search job has failed due to an error. You may be able view the job in the Job Inspector.

I am going to retry to run the command, but for some reason it takes so much time to run.

0 Karma

edrivera3
Builder

I got the same error. My roles are can_delete, user, power.

0 Karma

woodcock
Esteemed Legend

My solution will not work then, because evidently the use of eventstats precludes the use of delete (which, IMHO, is definitely a bug).

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Splunk does not have a "index this only if it's not already indexed" feature. The performance of such feature probably would be poor. Nor will it replace or update anything already indexed.
You can remove duplicate data (or any data) by piping a search to the delete command.

---
If this reply helps you, Karma would be appreciated.
0 Karma

edrivera3
Builder

Ok. That's too bad. But how to make Splunk delete events that has a new version of them? I know about the delete command, but I haven't been successful to select the appropriate data. With the below command I've been able to see the indextime and which files have more than two files for the same source.
... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | stats count by source| where count>1

0 Karma

edrivera3
Builder

The only way I have found is deleting file at a time which is very inefficient.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

You need to have permission to use the delete command. That's the best way to remove events from Splunk.

---
If this reply helps you, Karma would be appreciated.
0 Karma

edrivera3
Builder

I have permission to use the delete command, the problem is that I don't know how to select the appropriate data for deleting.

0 Karma

woodcock
Esteemed Legend

See my Answer!

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...