Splunk Search

Splunk Search - Select Top 10 results by non-unique field

omend
Path Finder

Hi all,

I have gathered into Splunk sales information of store branches across the US.
The data is in the format:

    <City Name> <Branch ID> <Number of Workers> <Sales>

I would like to create a search that select the top 10 cities, sorted by their number of branches, with all of their branches sales statistics.
If possible, the results should not be grouped by the city name but just all records that match.

For example, this could be the data of the first city (that was selected first because it has 4 branches, more than the rest of the cities) :

    <New York> <1> <10> <6000>
    <New York> <2> <10> <5000>
    <New York> <3> <5>  <5000>
    <New York> <4> <5>  <5000>

How could I do that?
If I use stats to group all the records by city, I can use "dedup 10 City_Name" but then all statistics are gathered if they are equals. For example, I would get:

City_Name Branch_ID Number_Of_Workers Sales
          1         10                6000
New York  2         5                 5000
          3
          4

If I don't use stats, I don't know how I can select only 10 cities, since the number of records of those 10 cities is unknown and can only be resolved inside the search.

Thanks,
Ori.

Tags (3)
1 Solution

sideview
SplunkTrust
SplunkTrust

Whenever you want to run stats, but you don't want to actually transform the set, you should look to eventstats and streamstats

for example, foo | eventstats dc(branchId) as branchCount by cityName will actually look at every row, calculate the number of branches for each city, and then write back onto the original incoming rows, one new field called branchCount, without actually transforming the rows.

If we then sort the set by that number, we can use streamstats to work through row by row and count the number of distinct values of cityName we see. (Streamstats works as it goes, as opposed to eventstats which calculates the statistics for the entire search result set. )

foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank

This will give us a field called rank that is "1" for the city that has the most branches, "2" for the city with the secondmost branches, and so on and so forth.

Then we just throw on a | where rank<5 | sort - _time, to filter to just the cities that are in the top 4, and we sort by time descending again, to basically restore our original sort order.

foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank | where rank<5 | sort - _time

(Note that streamstats dc(branchCount) would have been problematic, for two cities in the top 4 might have the same number of branches, and that would lead to us displaying the top 5 or 6 cities by mistake. However assuming your cityName values are unique, dc(cityName) wont suffer that problem. )

http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/Streamstats

http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/EventStats

View solution in original post

sideview
SplunkTrust
SplunkTrust

Whenever you want to run stats, but you don't want to actually transform the set, you should look to eventstats and streamstats

for example, foo | eventstats dc(branchId) as branchCount by cityName will actually look at every row, calculate the number of branches for each city, and then write back onto the original incoming rows, one new field called branchCount, without actually transforming the rows.

If we then sort the set by that number, we can use streamstats to work through row by row and count the number of distinct values of cityName we see. (Streamstats works as it goes, as opposed to eventstats which calculates the statistics for the entire search result set. )

foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank

This will give us a field called rank that is "1" for the city that has the most branches, "2" for the city with the secondmost branches, and so on and so forth.

Then we just throw on a | where rank<5 | sort - _time, to filter to just the cities that are in the top 4, and we sort by time descending again, to basically restore our original sort order.

foo | eventstats dc(branchId) as branchCount by cityName | sort - branchCount | streamstats dc(cityName) as rank | where rank<5 | sort - _time

(Note that streamstats dc(branchCount) would have been problematic, for two cities in the top 4 might have the same number of branches, and that would lead to us displaying the top 5 or 6 cities by mistake. However assuming your cityName values are unique, dc(cityName) wont suffer that problem. )

http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/Streamstats

http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/EventStats

omend
Path Finder

Amazing.
Thank you very much for the answer and the excellent explanation.

0 Karma

kishorksudha
Explorer

Good Explanation

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...