Getting Data In

Why is huge duplicate and unwanted data being indexed into Splunk?

santosh11
New Member

Dear All,

We are getting huge duplicate data and unwanted data into splunk and while we are querying the performance is getting effected. Below is the senario:

We are using HF to push the data into Splunk Cloud.

this is an example of duplicate data.

source type A: 1, AA
Source Type A: 1, AA

This is an example of unwanted data:

source type A: 1, AA
Source Type A: 1, AB

Here second one got updated by AB and we wont be needing first one(AA) any more any more.

Because of this splunk scans 20,00,00,000 events and out of that we get 1,50,00,000 which are useful.

Can someone suggest better way to maintain data in index.

Regards,
Santosh

0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

View solution in original post

0 Karma

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

0 Karma
Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...