Splunk Search

Remove duplicate values from a multivalue field

systemjack
Explorer

I have an mvfield like contract="C53124 C53124 C67943" and I want to end up with unique values like contract="C53124 C67943".

The field is the result of a lookup table matching multiple contracts to a given tracking id in the summary result set, and duplicates are caused because there's also a contract_line component in the lookup (ex. C53124 line 1 and line 2 both map to tracking id X).

The purpose is to later use mvexpand on contract and not get unnecessary duplicate records since we don't care about contract_line.

We are currently using a command backed by a Python script to do this for all fields (thanks Vincent), but I'm wondering if there's a way to do it pure splunk within a query for one particular field.

I've looked around a bunch in splunkbase and looked over all the search commands and functions with no luck so I'm hoping you all can help.

1 Solution

systemjack
Explorer

I think I figured it out.

Here's an updated version as a macro with an improved generic rex:

[dedup_mv_field(1)]
args = field_name
definition = | nomv $field_name$ | rex field=$field_name$ mode=sed "s/\b([^\n]+)\n(?=(.|\n)*\1)//g" | makemv tokenizer="([^\n]*)(\n)?" $field_name$

On a side note, strange that I still haven't figured out how to do a newline in a splunk string literal (could have used delim instead of tokenizer) and it's also strange that the makemv default behavior isn't a compliment of nomv so something like the following would work:

| nomv somemvfield | `tweak_the_field` | makemv somemvfield.

View solution in original post

cphair
Builder

It doesn't look like you need another answer here, but be aware that as of 6.2, there is an mvdedup command that does exactly what you need: eval contract=mvdedup(contract). See http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/CommonEvalFunctions.

mrpaul
Explorer

Hurrah! Having it built in is the best answer yet. Good enhancement.

0 Karma

mrpaul
Explorer

Thanks! Your solution was super helpful systemjack!

However, I ran into a few challenges that broke your solution:
- data that contains characters that might be word boundaries within the text. For example, if the data is "C-53124 C-53124 C-67943", the hyphens form word boundaries.

- data that is a subset of another item. EG: "C-53124 C-53124 C-67943 C-53124567" (C-53124567 contains C-53124)

In my case, I was also working with text that was already a single string, separated by semicolons, with duplicates. Therefore, I could skip the nomv step. I also could use delim rather than tokenizer, since I had a simple delimiter. Here is what I used, building off your work:

rex field=field_name mode=sed "s/((^|;)[^;]+);(?=.*\1(;|$))/;/g"  | makemv delim=";" field_name

I spent a long time working though your regex, so here is an explanation for folks who aren't regex junkies. At a high level, the deduplication approach is to search for a string, followed by anything, followed by that string again, and trash the first instance of the string. We do that through all possible matches, so only the final instance remains.

In more detail:

  1. Search for a string that doesn't contain our delimiter:

    [^;]+

  2. Search for that string either at the beginning of the line or with the delimiter in front of it, and with the delimiter after it:

    ((^|;)[^;]+);

  3. With the double parentheses, what is in the first parenthesis is saved as \1 and what is in the inner parenthesis is saved as \2.

  4. We then look ahead for anything, followed by our string (\1), followed by either a delimiter or the end of the line. (?= performs a "positive lookahead"):

    (?=.*\1(;|$))

  5. In the replace portion, we throw away everything, except what was in the positive lookahead, and replace it with a ;

    /;/

  6. And we do all this globally - anywhere the pattern will match. This addresses multiple duplicate items, triplicates, etc.

    g

I suppose, one could create two macros out of this, for strings and mv's, and handle any delimeter: (I have not tested this)

[string_dedup(2)]
args = field_name, delimiter
definition = rex field=$field_name$ mode=sed "s/((^|$delimiter$)[^$delimiter$]+)$delimiter$(?=.*\1($delimiter$|$))/$delimiter$/g"

[mv_dedup(2)]
args = field_name, delimiter
definition = nomv $field_name$ | `string_dedup($field_name$, $delimiter$)` | makemv delim="$delimiter$" $field_name$
0 Karma

systemjack
Explorer

There can be any number of contract lines per contract and any number of contracts per tracking id. I've wondered myself if any set of contracts would always be adjacent in the field but am too newb at splunk to know.

systemjack
Explorer

I think I figured it out.

Here's an updated version as a macro with an improved generic rex:

[dedup_mv_field(1)]
args = field_name
definition = | nomv $field_name$ | rex field=$field_name$ mode=sed "s/\b([^\n]+)\n(?=(.|\n)*\1)//g" | makemv tokenizer="([^\n]*)(\n)?" $field_name$

On a side note, strange that I still haven't figured out how to do a newline in a splunk string literal (could have used delim instead of tokenizer) and it's also strange that the makemv default behavior isn't a compliment of nomv so something like the following would work:

| nomv somemvfield | `tweak_the_field` | makemv somemvfield.

cphair
Builder

Is it always the first two entries of the mvfield that are duplicates, or do you need to remove an arbitrary number of duplicates from the field?

systemjack
Explorer

Using dedup after the mvexpand for some combination of trackingid and contract would be the easiest answer. Unfortunately our data is kinda squirly and there's no set of fields we could dedup on where we can be sure we wouldn't lose something.

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...