Solved: Remove duplicate values from a multivalue field

systemjack · ‎03-27-2012

I have an mvfield like contract="C53124 C53124 C67943" and I want to end up with unique values like contract="C53124 C67943".

The field is the result of a lookup table matching multiple contracts to a given tracking id in the summary result set, and duplicates are caused because there's also a contract_line component in the lookup (ex. C53124 line 1 and line 2 both map to tracking id X).

The purpose is to later use mvexpand on contract and not get unnecessary duplicate records since we don't care about contract_line.

We are currently using a command backed by a Python script to do this for all fields (thanks Vincent), but I'm wondering if there's a way to do it pure splunk within a query for one particular field.

I've looked around a bunch in splunkbase and looked over all the search commands and functions with no luck so I'm hoping you all can help.

systemjack · ‎03-27-2012

I think I figured it out.

Here's an updated version as a macro with an improved generic rex:

[dedup_mv_field(1)]
args = field_name
definition = | nomv $field_name$ | rex field=$field_name$ mode=sed "s/\b([^\n]+)\n(?=(.|\n)*\1)//g" | makemv tokenizer="([^\n]*)(\n)?" $field_name$

On a side note, strange that I still haven't figured out how to do a newline in a splunk string literal (could have used delim instead of tokenizer) and it's also strange that the makemv default behavior isn't a compliment of nomv so something like the following would work:

| nomv somemvfield | `tweak_the_field` | makemv somemvfield.

View solution in original post

cphair · ‎09-16-2016

It doesn't look like you need another answer here, but be aware that as of 6.2, there is an mvdedup command that does exactly what you need: eval contract=mvdedup(contract). See http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/CommonEvalFunctions.

mrpaul · ‎09-16-2016

Hurrah! Having it built in is the best answer yet. Good enhancement.

mrpaul · ‎09-16-2016

Thanks! Your solution was super helpful systemjack!

However, I ran into a few challenges that broke your solution:
- data that contains characters that might be word boundaries within the text. For example, if the data is "C-53124 C-53124 C-67943", the hyphens form word boundaries.

- data that is a subset of another item. EG: "C-53124 C-53124 C-67943 C-53124567" (C-53124567 contains C-53124)

In my case, I was also working with text that was already a single string, separated by semicolons, with duplicates. Therefore, I could skip the nomv step. I also could use delim rather than tokenizer, since I had a simple delimiter. Here is what I used, building off your work:

rex field=field_name mode=sed "s/((^|;)[^;]+);(?=.*\1(;|$))/;/g"  | makemv delim=";" field_name

I spent a long time working though your regex, so here is an explanation for folks who aren't regex junkies. At a high level, the deduplication approach is to search for a string, followed by anything, followed by that string again, and trash the first instance of the string. We do that through all possible matches, so only the final instance remains.

In more detail:

Search for a string that doesn't contain our delimiter:

[^;]+
Search for that string either at the beginning of the line or with the delimiter in front of it, and with the delimiter after it:

((^|;)[^;]+);
With the double parentheses, what is in the first parenthesis is saved as \1 and what is in the inner parenthesis is saved as \2.
We then look ahead for anything, followed by our string (\1), followed by either a delimiter or the end of the line. (?= performs a "positive lookahead"):

(?=.*\1(;|$))
In the replace portion, we throw away everything, except what was in the positive lookahead, and replace it with a ;

/;/
And we do all this globally - anywhere the pattern will match. This addresses multiple duplicate items, triplicates, etc.

g

I suppose, one could create two macros out of this, for strings and mv's, and handle any delimeter: (I have not tested this)

[string_dedup(2)]
args = field_name, delimiter
definition = rex field=$field_name$ mode=sed "s/((^|$delimiter$)[^$delimiter$]+)$delimiter$(?=.*\1($delimiter$|$))/$delimiter$/g"

[mv_dedup(2)]
args = field_name, delimiter
definition = nomv $field_name$ | `string_dedup($field_name$, $delimiter$)` | makemv delim="$delimiter$" $field_name$

systemjack · ‎03-27-2012

There can be any number of contract lines per contract and any number of contracts per tracking id. I've wondered myself if any set of contracts would always be adjacent in the field but am too newb at splunk to know.

systemjack · ‎03-27-2012

I think I figured it out.

Here's an updated version as a macro with an improved generic rex:

[dedup_mv_field(1)]
args = field_name
definition = | nomv $field_name$ | rex field=$field_name$ mode=sed "s/\b([^\n]+)\n(?=(.|\n)*\1)//g" | makemv tokenizer="([^\n]*)(\n)?" $field_name$

On a side note, strange that I still haven't figured out how to do a newline in a splunk string literal (could have used delim instead of tokenizer) and it's also strange that the makemv default behavior isn't a compliment of nomv so something like the following would work:

| nomv somemvfield | `tweak_the_field` | makemv somemvfield.

cphair · ‎03-27-2012

Is it always the first two entries of the mvfield that are duplicates, or do you need to remove an arbitrary number of duplicates from the field?

systemjack · ‎03-27-2012

Using dedup after the mvexpand for some combination of trackingid and contract would be the easiest answer. Unfortunately our data is kinda squirly and there's no set of fields we could dedup on where we can be sure we wouldn't lose something.

Remove duplicate values from a multivalue field

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!