finding approximate percentage difference / simila...

Sukisen1981 · ‎07-29-2017

I am trying to find a way to find an automatic categorization based on text difference / matches between 2 events for the same field.
For example , I have a field called ticket summary. I have an existing data set which is something like this
Ticket summary = "application down. can not log in to application" say at time 01/01/2017
Ticket summary =" application not working. click on customer id link and then application just hangs" say at time 01/03/2017
...
Now, I get a new ticket summary which is something like this
Ticket summary ="issues when i click the customer id link, nothing happens after i click it "

My goal is to have the new ticket summary linked more closely to the entry at 01/03/2017.
Ultimately , I want to upload all historical ticket summaries through a CSV input and use that historical data to categorize real time / new events. I want to have the new events categorization based on the nearest / exact match to the historical ticket summaries.
Possible?

Sukisen1981 · ‎09-16-2017

Hi - Sorry for responding late.
I had to find a different way, probably more suited to what I need.
Well I used the Jellyfisher app and am using the jaro winkler distance to compare similarity between user input and historical data.
The initial results look good, but I need to train the model for some months more with more data...will keep this forum updated 🙂
and many thanks for your help

aljohnson_splun · ‎08-30-2017

Hey @Sukisen1981 - have you tried using TFIDF + clustering to group your events?

DalJeanis · ‎08-01-2017

This is by no means perfect, but it can at least give you a reasonable number or two to play with. It looks for the longest run of letters in the first and last names, then assigns two different values, one based on how much of the actual names went to the sent email address, the other based on what percentage of the sent email address came from the actual names.

| makeresults 
| eval mydata="dow.jones@finecompany.com,djones2753@gmail.com george.plimpton@finecompany.com,paper.lion@cantplayworth.beans mickey.mouse@disney.com,minniemouse6969@polkadotdress.edu"
| makemv mydata 
| mvexpand mydata 
| rex field=mydata "(?<workemail>[^,]*),(?<sendemail>.*)"
| rename COMMENT as "The above just creates test data"

| rename COMMENT as "assign a record number so we can put it back together. grab the first and last names, dropping domains.  Also drop all numbers out of the sent email address"
| streamstats count as recno
| rex field=workemail "^(?<firstname>[^\.]*)\.(?<lastname>[^@]*)"
| rex field=sendemail "^(?<email>[^@]*)"
| rex field=email mode=sed "s/\d//g"

| rename COMMENT as "spread the record into two, one for first name and one for last name."
| eval whichname=mvappend("first","last") 
| mvexpand whichname 
| eval thename=if(whichname="first",firstname,lastname)

| rename COMMENT as "further spread the data into one record for each letter in the name and each letter in the email being tested."
| eval emailfan=mvrange(1,len(email)+1)
| eval namefan=mvrange(1,len(thename)+1)
| mvexpand emailfan 
| mvexpand namefan

| rename COMMENT as "test whether each letter is the same."
| eval emailletter=substr(email,emailfan,1)
| eval nameletter=substr(thename,namefan,1)
| eval matchletter=if(emailletter=nameletter,1,0)

| rename COMMENT as "calculate the offset (positive or negative) so that we can check for runs of matching letters, which will all have the same offset."
| eval offset=emailfan-namefan
| sort 0 recno whichname offset namefan

| rename COMMENT as "set flags for each group that either matches or not, then calculate the group number and group size"
| streamstats current=f last(matchletter) as matchprev by recno whichname offset
| eval changematch=case(isnull(matchprev),1, matchletter=matchprev,0, true(),1)
| streamstats sum(changematch) as groupno by recno whichname offset 
| eventstats count as groupsize by recno groupno whichname offset

| rename COMMENT as "get rid of all unmatched groups, then calculate just the largest match group for each of the first and last names"
| rename COMMENT as "finally, sum the two numbers and compare to the length of the email for a rough match number."
| where matchletter=1
| stats max(groupsize) as maxmatch first(workemail) as workemail first(sendemail) as sendemail first(firstname) as firstname first(lastname) as lastname first(email) as email by recno whichname
| eval namescore=round(100*maxmatch/len(if(whichname="first",firstname,lastname)))
| stats sum(maxmatch) as totalmatch, sum(namescore) as namescore first(workemail) as workemail first(sendemail) as sendemail first(firstname) as firstname first(lastname) as lastname first(email) as email by recno
| eval matchscore = round(100*totalmatch/len(email))
| table recno workemail sendemail firstname lastname email matchscore namescore

Richfez · ‎07-29-2017

I believe something like this is possible.

The command you'll want is likely to be cluster. Vary what you set t equal to and see what effect it has on your clustering.

Happy Splunking,
Rich

Sukisen1981 · ‎07-30-2017

Many thanks for your reply.
Well, i did try the cluster command and it does look similar to what I want.
There is one issue however, the cluster commands splits events into clusters based on the t value.
But, how do I know which of the clusters the new ticket summary falls into without manually checking the cluster definitions formed by the output of the cluster command?

The new ticket summary is something an L1 analyst will input via a static text filer or HTML area.
The aim is to form the clustering based on 2 parameters - the historical data AND a good match with the new ticket summary , the new ticket summary is of course not present in the historical data CSV.

Richfez · ‎07-30-2017

You are right, cluster sort of answers the wrong half of the problem, doesn't it? Well, it may be a part of whatever answer we ultimately think of, but it's not enough by itself.

This will take more thinking. A few random thoughts:

If you broke up each event into space delimited strings, you could count similarity between events by the percentage of strings they have in common. I know how to split _raw up like that, but the comparison part isn't something I'm aware Splunk can do (at least on my first think). Or if it can, it's not obviously simple. Also, keeping track of all this may involve some work too.

Of course now you get into another situation: differing words that should be counted as more or less the same. I had started writing a custom search command that would implement a fuzzing algorithm (Soundex and Double-Metaphone) on strings. Originally used in genealogy I think (to help match up last names when they were written different ways), but used now in various areas of study. Once everything ELSE works (which is the hard part, IMO), something like that could be used to make it a lot better so it can ignore most mispellings and things.

So, in the meantime, I'm going to move this answer to a comment against your question which will toss your question back into the bucket of unanswered. Hopefully I'll think of something and have time to outline it, or someone else will come up with something in the meantime.

finding approximate percentage difference / similarity between 2 text fields

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk