Refine your search:

I have a search which returns the result as frequency table:


uploads    frequency
  0           6
  1           4
  2           1
  5           1
Basically, 6 users have uploaded 0 times, 4 users uploaded 1 time, and so on.

Now I want to compute stats such as the mean, median, and mode. Is there an easier way to compute stats directly on a frequency table such as this? (ie mean should be 0.916666, and mode is 0) This seems to be something that is required quite often, yet I haven't found a way to do it.

I understand there's way to calculate stats on the data before I turn it into a frequency table, the problem is that the query is rather complex (because in order to include the upload == 0 case, I had to do a lot of extra work). In other words, the underlying logs have no events when there are no uploads, but to calculate accurate statistics, the 0 events must be taken into consideration.

asked 28 Jun '11, 14:55

jyzhang's gravatar image

jyzhang
3112
accept rate: 0%


2 Answers:

If you're frequency table is generated by you running stats count, and then appending the other items, you can do it most efficiently with:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats mean(uploads) median(uploads) mode(uploads)

or, to get back the above frequency table:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats count as frequency by uploads

if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.

link

answered 28 Jun '11, 18:39

gkanapathy's gravatar image

gkanapathy ♦
36.8k81228
accept rate: 41%

Note the above will let you calculate mean(), stdev(), count(), and any percentile, along with a few other functions. Adding max(), min(), first(), last(), etc aren't that hard, but left as an exercise.

(28 Jun '11, 18:47) gkanapathy ♦

No easy way to compute all of these in a single simple query.

mean:

... | stats 
        sum(uploads) as totaluploads 
        sum(frequency) as totalcount 
    | eval mean=totaluploads/totalcount

mode:

... | sort - frequency | head 1 | eval mode=uploads

median:

... | sort uploads 
    | eventstats sum(frequency) as totalcount | eval medianrank=floor((totalcount/2))
    | streamstats global=f current=t window=0
        sum(frequency) as maxrank
    | where (maxrank-frequency <= medianrank) AND (medianrank < frequency)
    | eval median = uploads

trickery to combine them:

... | sort - frequency 
    | eventstats first(uploads) as mode 
    | sort uploads 
    | eventstats 
        sum(frequency) as totalcount 
        sum(uploads) as totaluploads
    | eval mean=totaluploads/totalcount
    | eval medianrank=floor((totalcount/2))
    | streamstats current=t window=0
        sum(frequency) as maxrank
    | eval median=if((maxrank-frequency <= medianrank) AND (medianrank < frequency), uploads, null())
    | where isnotnull(median)
link

answered 28 Jun '11, 16:24

gkanapathy's gravatar image

gkanapathy ♦
36.8k81228
accept rate: 41%

If that's the case, I'm thinking of writing my own custom command to expand the frequency table... this way I can pipe the results to stats and do mean(..), median(..), p90(..), etc. Unless something like this already exists...

(28 Jun '11, 17:43) jyzhang
Post your answer
toggle preview

Follow this question

Log In to enable email subscriptions

RSS:

Answers

Answers + Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×423
×408
×285
×5
×1

Asked: 28 Jun '11, 14:55

Seen: 3,032 times

Last updated: 28 Jun '11, 18:47

Copyright © 2005-2014 Splunk Inc. All rights reserved.