How to not include the duplicated events while acc...

luhadia_aditya · ‎03-10-2016

How should I add '| dedup' as one of the constraints of the dat model ?

We have a data model having a sourcetype as a base constraint and other fields using which we generate statistical reports by tstats searches. This sourcetype has got duplicated data.

I would want to filter out the duplicated events so for accurate statistics reporting in the data model, so that the generated summaries are accurate. (some thing like '| dedup _raw' right before the 'stats' command in the usual searches)

Please suggest some ideas. Thanks!

Richfez · ‎03-11-2016

I don't know about the precise question you asked - but I'd investigate why you have duplicate data in the first place. I know that won't help with historical information but it seems like the right answer here.

Is there information lacking in the logs making events appear duplicated? Are you grabbing a set of logs twice? Do two hosts both report the same information?

luhadia_aditya · ‎03-11-2016

Well, I have found the root cause of the duplication and have resolved it too.

To sum up the question - the issue persisted for a month and for this duration we have duplication. We have reports being generated on this data every now and then by the users and the stats reported are not accurate due to dupes. These reports come from the accelerated summaries created by a data model.
Now, how can I not include the duplicated events in the data model summaries to have the stats accurate ?

Thanks for your concern and response.

Richfez · ‎03-11-2016

I'm glad to hear you've got it straightened out now.

I think you have a couple of options. d and bwooden do a far better job of summarizing some of them in this answer, though I'd caution TEST TEST TEST before doing some of those! Remember, you shoudl be 100% the results of that search are really what you want to delete before you ever even enable the ability to USE delete. 😉

Anyway, If that's helpful please upvote that very thorough tag-teamed answer to give them some credit for it.

Your idea about including a dedup would probably work really well, except it'll be a huge performance impact all the time. Now, if perhaps you only need that for short while until that data expires out of the system, then maybe that's the easy way to go.

luhadia_aditya · ‎03-29-2016

Thanks a lot for your pointer Rich.

I have had already considered the scenarios they presented, and thats the reason I wanted the dedup to be incorporated in to the developed data-model itself rather than to have the dupes deleted.

I would highly appreciate if you may suggest any idea in terms to incorporate dedup in data-model itself, its alright to have the perf impact as its only gonna be applicable only for a month's data.

As far as I know, there is no way, thats why approaching all the Splunk Ninjas !! Thanks once again!

How to not include the duplicated events while accelerating the data model

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!