Comments and answers for "Similarity "score" of similar values of fields>"
https://answers.splunk.com/answers/519385/similarity-score-of-similar-values-of-fields.html
The latest comments and answers for the question "Similarity "score" of similar values of fields>"Comment by oclumbertruck on oclumbertruck's comment
https://answers.splunk.com/comments/540361/view.html
Yes we have made some progress, and I want you to know that the content provided help tremendously - both from a technical and non-technical perspective.
What we basically did is took the minimum value for each attribute, and then used that as an origin or offset in a Gaussian decay function. Then on a per-attribute level, we are able to evaluate how "close" a particular attribute is to "perfect", or the item with the best in class for a given feature. By then multiplying these scores per feature together, we are able to get a score for the combined features.
In our scenario, all the features are essentially bad-things-to-begin-with, so the larger the number the worse off the item is, and with the decay we are able to see how thing compare by feature and drive scores into the ground as they become more and more irrelevant to the "perfect" score.
So, not entirely what we had set out for, but a valid workaround that seems to fit our needs.
Thanks again for your help and insight.Mon, 22 May 2017 16:07:42 GMToclumbertruckComment by DalJeanis
https://answers.splunk.com/comments/529451/view.html
@oclumbertruck - have you made any progress on this one?Thu, 18 May 2017 18:06:56 GMTDalJeanisComment by DalJeanis on DalJeanis's answer
https://answers.splunk.com/comments/520885/view.html
Here's an interesting semi-standard approach - using some arbitrary yardstick of stdevs, calculate whether or not the items are similar on each measurement they share in common, and assign by fiat that they are not similar on any measurement that they do not share. Then calculate the SĂ¸rensenâ€“Dice (DSC) coefficient for the two items.
https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
Let's say that dog has 24 measurements, cat has 26, they share 20 items, and allowing readings within 1 stdev to be considered similar, they are found similar on 12 measurements.
DSC = 2*12 / (24+26) = 24/50 = 48% similar
Or, you could also reverse the process and ask, how many stdevs of variance do I need to allow as similar in order for the two items to be considered 50% (or some other yardstick) similar? This is only slightly more complicated to calculate - in the case of the above, it would be the 12.5th-13th smallest number of stdevs found among all the measurements.Wed, 12 Apr 2017 16:36:47 GMTDalJeanisAnswer by DalJeanis
https://answers.splunk.com/answering/520882/view.html
Answering this question in the general (as opposed to answering it for a specific application) requires roughly two semesters of graduate statistics.
Basically, you have to define and measure Similarity, which also requires that you define and measure Difference, or Variability. All of which requires some kind of scoring methodology, which **usually** would be determined in conjunction with understanding what the underlying measures are.
As a first, awfully simplistic way of looking at the question, you could take the measures that the two items have in common, and calculate the stdev for the entire population of items on each measure, and then calculate how many stdevs away from each other the two are. You could initially do that in terms of z score or percentile or whatever... the "right" choice will have to use successive approximation until the answers are coming in sensibly based on reference items you KNOW to be similar and items you KNOW to be different. The only requirement is that all the measures are scored the same, relative to their baselines (which is why you use zscore or stdevs or percentile rather than gross score difference).
Any measures that the two do NOT have in common must be treated as differences, and assigned some arbitrarily high distance/zscore/percentile.
Their gross geometric difference score then becomes the square root of the sum of the squares of their differences... which may yield some information or may not.
----------
One of the basic problems with the strategy behind your source data is that you've ALREADY extracted various statistical information which identifies relationships between things, but then you've deleted the metaknowledge that relates those statistics to each other. Assuming items were different models of car, you have twelve numbers that represent, in no particular order and in no particular standard of measurement, the car's wheel base, miles per gallon, horsepower, weight of car, number of passengers, recommended mileage for first maintenance, sticker price, overall length, turning radius, number of cylinders, customer satisfaction rating, number of such cars produced and sold per year, and so on.
A proper treatment analyzing differences between car models would have to be cognizant of which variables were expected to move together. Smaller cars get better gas mileage, therefore as weight drops, MPG goes up and length and wheel base drop. A car which violates this rule is likely to be an outlier of some sort, and "different" from those that track the rule. However, following the rule does not make two cars at different points on the curve "similar" to each other, they are just exemplars of their portion of the weight-performance curve.Wed, 12 Apr 2017 16:08:42 GMTDalJeanis