Splunk Search

Remove duplicate keywords (values) returned from field

pgadhari
Builder

Hi All,

I am splitting a Description field with "space" using Split command and generating list of keywords ( doing sort of text analytics) and doing the stats count by keywords to find top count of keywords in the data. The problem is, there is an "order" keyword which is coming twice in the Description, but I want to take the unique count of the keyword or remove the duplicate so that I can get the proper count of that "keyword" presence in that specific ticket Description field. Please help me how can I do that ? Following is my query, INC00001 is specific ticket number:

Description field has this text --> Order #1111111 Order date #06/25/13 client needs to process return but the invoice hasn't dropped yet

index=abc source="abc.csv" INC0001 | eval words=split(Description," ")| stats count by words

Getting following output from this query :

Order 2
but 1
date 1
dropped 1
hasn't 1
invoice 1
needs 1
process 1
return 1
the 1
to 1
yet 1

I want the query, that should count Order as one event instead of 2, as it is part of same ticket Description. I tried doing dedup but that is not working. I have to do this for ~15K ticket events for whatever top keyword counts I will be getting. But first I am trying to get it worked for single ticket event. Please help, this is quite urgent.

Regards
Pankaj

0 Karma

sundareshr
Legend

Try this

index=abc source="abc.csv" INC0001 | makemv Description delim=" " | mvexpand Description | stats dc(Description)  as count  by Description

pgadhari
Builder

Hi,

I tried that command, but this is good for some number of incidents. In my data I have 25000 events (tickets) and I want to run this for all those events, it is getting stuck saying memory error. Also, after that I have to count the keywords and sort it by count so that i can show the maximum number of occurrences of those keywords (sort of text analytics in my data) i.e. i have to count for common keywords and want to show top 50 keywords with their count. Whether we can do that by adding some search to this query ?

*command.mvexpand: output will be truncated at 134000 results due to excessive memory usage. Memory threshold of 500MB has been reached *

Regards
PG

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Try ... | stats distinct_count by words.

---
If this reply helps you, Karma would be appreciated.

pgadhari
Builder

I used distinct_count by not getting the answer in the way i want. It is returning the table for all the fields in the data. Attaching the screenshot of the same. I want the answer in the simple way as below, so that I can sort on count for all the ticket data and show the top occurrences of the keywords. Somehow i cannot attach the screenshot due to less karma points.

words distinct_count(Assignment group) distinct_count(Category) distinct_count(Description) distinct_count(Location) distinct_count(Month _ Year) distinct_count(Name) distinct_count(Number) distinct_count(Open_Date_Time) distinct_count(Priority) distinct_count(SRT Held) distinct_count(Short description) distinct_count(Status) distinct_count(Task Type) distinct_count(asset_id_1) distinct_count(current_ticket_state) distinct_count(date_hour) distinct_count(date_mday) distinct_count(date_minute) distinct_count(date_month) distinct_count(date_wday) distinct_count(date_year) distinct_count(date_zone) distinct_count(eventtype) distinct_count(host) distinct_count(index) distinct_count(linecount) distinct_count(owner_name_1) distinct_count(punct) distinct_count(source) distinct_count(sourcetype) distinct_count(splunk_server) distinct_count(splunk_server_group) distinct_count(system_user_1) distinct_count(tag) distinct_count(tag::eventtype) distinct_count(time_submitted)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1

06/22/16client 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1

1032844299 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1

Order 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
but 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
date 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
dropped 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
hasn't 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
invoice 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
needs 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
process 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
return 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
the 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1
to 1 1 1 1 1 1 1 1

I want in below format :

words count
Order 1
the 1
date 1

0 Karma

richgalloway
SplunkTrust
SplunkTrust

... | stats distinct_count(Description) by words should do it.

---
If this reply helps you, Karma would be appreciated.

pgadhari
Builder

Hi Rich

Thanks for that, it is working. Now, actually I have got the dc of words and I am counting the total occurrence of that word. Now, I have to filter the words that are not relevant like "the,is,for,that"... etc...

I am putting :

index=abc source="abc.csv" |eval words=split(Description," ") | stats distinct_count(Description) as count by words | sort - count | search words!="to" AND words!="is" AND words!="" AND words!="in" AND words!="the" AND words!="not" ....

words count
for 3565
and 3294
SAP 3178
on 2897
a 2546
of 2327

I am using search words!="keyword" to remove the keywords not wanted, but this will create a long list of the keywords and for me also it will take more time. Is there any way that I use multiple keywords in search without using != & AND operators by specifying them in the search command syntax or you can suggest some way for doing it faster? I have lot of keywords being returned from 25000 events :-). Please advise.

Regards
PG

0 Karma

richgalloway
SplunkTrust
SplunkTrust

A slightly easier construct is ... | search NOT (words="to" OR words="is" OR words="in" OR words="the"...) | ..., but that's probably not the kind of shortcut you're looking for. If I were doing this, I'd look try a lookup table of words to ignore. I'll have to think about it a while before offering a specific suggestion, however.

---
If this reply helps you, Karma would be appreciated.
0 Karma

pgadhari
Builder

Hi Rich,

I got it. Actually, I have created a list of keywords now that needs to be searched, there are some 200 to 250 words. I want to put it in lookup table and search the occurrence of each keyword in each event and need to find out the total count and percentage occurring in the raw event. Also I want to make sure there should not be duplicates while searching.. I mean if "order" word is coming twice or thrice in one event, I should count it as disctinct one.

The only challenge I am getting is comparing each keyword from the lookup file and looking for it in the raw events and showing the count and percentage. can you help me in that query please.

Regards
PG

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...