For example, if i have a username of bsmith843 in a field returned by one search, and bsmiths845 as a field from another search, is there any way to gauge the similarity between the two strings? I know i can use wildcards/regex to try and match the strings, but if i can't match everyone i would like to know how similar they are..
And from even further in the future...
There is an app in Splunkbase which supports Levenshtein distance, Damerau-Levenshtein_distance, Jaro distance, Jaro winkler, match rating comparison, and Hamming distance comparisons, plus a number of phonetic algorithms, including soundex. It is called JellyFisher. Here is a sample Levenshtein distance evaluation using this app:
... | jellyfisher levensthein_distance(sourcetype,source)
What would be returned here is an integer, according to this description of Levenshtein distance.
Each of the JellyFisher functions returns the result in a field named after the function (i.e., levensthein_distance, damerau_levenshtein_distance, soundex).
Here is a link to the JellyFisher app.
Here is a mocked-up use of it:
| makeresults
| eval foo="kitten", bar="smitten"
| jellyfisher levenshtein_distance(foo, bar)
| table foo bar levenshtein_distance
There is a python function that does something very close to this. It returns a number between 0 and 1 based on the similarity of two terms. You can find it in the difflib
module.
Here is a really quick example of an app named "fieldcompare" which contains a single python search command. The app is made up of the following files:
$SPLUNK_HOME/etc/apps/fieldcompare/bin/fieldcompare.py
import splunk.Intersplunk
import difflib
(isgetinfo, sys.argv) = splunk.Intersplunk.isGetInfo(sys.argv)
args, kwargs = splunk.Intersplunk.getKeywordsAndOptions()
if isgetinfo:
# streaming, generating, retevs, reqsop, preop
splunk.Intersplunk.outputInfo(True, False, False, False, None)
(results, dummyresults, settings) = splunk.Intersplunk.getOrganizedResults()
field1_name = kwargs.get("field1", "field1")
field2_name = kwargs.get("field2", "field2")
output_field = kwargs.get("result", "ratio")
try:
for result in results:
try:
f1 = result[field1_name]
f2 = result[field2_name]
except KeyError:
# If either field is missing, simply ignore
continue
sm = difflib.SequenceMatcher(None, f1, f2)
result[output_field] = sm.ratio()
splunk.Intersplunk.outputResults(results)
except Exception, e:
splunk.Intersplunk.generateErrorResults("Unhandled exception: %s" % (e,))
$SPLUNK_HOME/etc/apps/fieldcompare/default/commands.conf
:
[fieldcompare]
filename = fieldcompare.py
supports_getinfo = true
$SPLUNK_HOME/etc/apps/fieldcompare/metadata/default.meta:
[commands/fieldcompare]
access = read : [ * ], write : [ admin ]
export = system
[scripts/fieldcompare.py]
access = read : [ * ], write : [ admin ]
export = system
If the example show above, the search command and app are called "fieldcompare", but you can use any name you want.
Here is a usage example:
... | fieldcompare field1=first_field field2=compare_field results=output | eval percent=round(100*output,2) | sort - percent
Be sure to look over the Custom search commands docs page for additional details about how you go about setting this up within your splunk environment.
I used this script but its throwing "Error in 'script': Getinfo probe failed for external search command 'fieldcompare'" error. Any suggestions ?
Yes, this can be done using a custom search script and one of the many Python modules that can compare strings. You can take a look at http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison which discusses using the Levenshtein distance as a measure. With more detail about your use case, I could suggest how to structure a search and custom command, but this should be enough to start with.
I bring to you a message from the future! Nimsh wrote a Levenshtein custom command at some point .. https://splunkbase.splunk.com/app/1898/