Solved: Why is huge duplicate and unwanted data being inde...

santosh11 · ‎10-11-2019

Dear All,

We are getting huge duplicate data and unwanted data into splunk and while we are querying the performance is getting effected. Below is the senario:

We are using HF to push the data into Splunk Cloud.

this is an example of duplicate data.

source type A: 1, AA
Source Type A: 1, AA

This is an example of unwanted data:

source type A: 1, AA
Source Type A: 1, AB

Here second one got updated by AB and we wont be needing first one(AA) any more any more.

Because of this splunk scans 20,00,00,000 events and out of that we get 1,50,00,000 which are useful.

Can someone suggest better way to maintain data in index.

Regards,
Santosh

gcusello · ‎10-12-2019

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

View solution in original post

gcusello · ‎10-12-2019

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

Why is huge duplicate and unwanted data being indexed into Splunk?

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!