Solved: List common substrings of at least 5 stations in a...

hylam · ‎11-14-2015

List common substrings of at least 5 stations. List also the users followed each substring. Is this splunk problem or algorithm class homework? A similar problem would be listing the top click streams in a web analytics sankey diagram of at least 5 pages.

https://en.wikipedia.org/wiki/Longest_common_substring_problem

subwayHistory.csv
_time,user,station
...

EDIT1
changed subsequence to substring, since users cannot jump stops

EDIT2
web analytics click stream problem.

Richfez · ‎11-14-2015

I had been asking about a use case because I don't really want to be doing people's homework for them, and this sounds amazingly like homework. Regardless of that, I have an answer, at least as far as I see it.

So, let's start with my csv. I generated it in Excel with three columns. Column A is initially "=now()" with subsequent fields equaling the previous field+(1/86000). I know that's not quite adding one second per event, but I like the occasional missed second - it makes me feel alive. Column B (user) is "=CEILING.MATH(RAND()*12)" (I should have used more users, but it works). Column C (station) is "=CHAR((RAND()*26)+97)"

Here's the first few lines of that.

time    user        station
11/14/15 20:10:42   3   i
11/14/15 20:10:43   9   h
11/14/15 20:10:44   2   k
11/14/15 20:10:45   10  p

I repeat that, oh, a couple thousand times. I think 2200 in this example.

So here's a terrible search, full of repeated things that I'm sure someone smarter can fix for me. I break it down somewhat later, but here it is whole.

index="temp_answers" sourcetype="csv" | transaction user maxpause=30s 
| search eventcount>=5 | fields user, station | mvcombine station 
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| fields - station | stats count by path | sort - count

The top 5 or so results gives (since I sorted it to put the most paths at top)

path    count
p q r s t   20
n o p q r   17
m n o p q   16
o p q r s   15
q r s t u   14

Which seems fine to me (there's a lot more, I just didn't paste it all in here.)

So, taking the search apart...

... | transaction user maxpause=30s

The transaction combines all the events with the same user, assuming those events aren't separated by more than 30 seconds. If they are, it starts a new transaction/group. The transactions occur in lengths of 1 to about 18 stations, so that seems to work OK.

| search eventcount>=5 | fields user, station | mvcombine station

We then only include transactions that have more than 5 items (because our minimum path is 5 stations). We strip out all the other fields, just leaving "user" and "station". Then we do some magic with mvcombine. That takes the set of events in each transaction and combines the individual stations into a string of stations, so it gives me things like "a b d e f g h j m n p r s u v w y ".

| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"

The first part uses rex to make as many matches of 5 characters separated by spaces as it can. This takes "a b d e f g h j m n p r s u v w y " and creates "a b d e f ", "g h j m n " and so on, but the second and subsequent extractions starts after the previous one string of 5, not one character shifted. Since I couldn't find a direct way to start at each character, I then rex it again but in sed mode, shifting the string one character (removing the first character and space). So, now we have several of the paths and the stations string shifted to look like this: "b d e f g h j m n p r s u v w y ".

This is repeated 5 times, each time getting the shifted strings. In the example string I've been using, "b d e f g ", etc... then "d e f g h" and so on.

| fields - station | stats count by path | sort - count

Then I remove the station field because it's no longer needed, then do a stats count by path. The final bit is just to sort it so the bigger counts are on top - not important for this, I just wanted to.

View solution in original post

Richfez · ‎11-14-2015

I had been asking about a use case because I don't really want to be doing people's homework for them, and this sounds amazingly like homework. Regardless of that, I have an answer, at least as far as I see it.

So, let's start with my csv. I generated it in Excel with three columns. Column A is initially "=now()" with subsequent fields equaling the previous field+(1/86000). I know that's not quite adding one second per event, but I like the occasional missed second - it makes me feel alive. Column B (user) is "=CEILING.MATH(RAND()*12)" (I should have used more users, but it works). Column C (station) is "=CHAR((RAND()*26)+97)"

Here's the first few lines of that.

time    user        station
11/14/15 20:10:42   3   i
11/14/15 20:10:43   9   h
11/14/15 20:10:44   2   k
11/14/15 20:10:45   10  p

I repeat that, oh, a couple thousand times. I think 2200 in this example.

So here's a terrible search, full of repeated things that I'm sure someone smarter can fix for me. I break it down somewhat later, but here it is whole.

index="temp_answers" sourcetype="csv" | transaction user maxpause=30s 
| search eventcount>=5 | fields user, station | mvcombine station 
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| fields - station | stats count by path | sort - count

The top 5 or so results gives (since I sorted it to put the most paths at top)

path    count
p q r s t   20
n o p q r   17
m n o p q   16
o p q r s   15
q r s t u   14

Which seems fine to me (there's a lot more, I just didn't paste it all in here.)

So, taking the search apart...

... | transaction user maxpause=30s

The transaction combines all the events with the same user, assuming those events aren't separated by more than 30 seconds. If they are, it starts a new transaction/group. The transactions occur in lengths of 1 to about 18 stations, so that seems to work OK.

| search eventcount>=5 | fields user, station | mvcombine station

We then only include transactions that have more than 5 items (because our minimum path is 5 stations). We strip out all the other fields, just leaving "user" and "station". Then we do some magic with mvcombine. That takes the set of events in each transaction and combines the individual stations into a string of stations, so it gives me things like "a b d e f g h j m n p r s u v w y ".

| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"

The first part uses rex to make as many matches of 5 characters separated by spaces as it can. This takes "a b d e f g h j m n p r s u v w y " and creates "a b d e f ", "g h j m n " and so on, but the second and subsequent extractions starts after the previous one string of 5, not one character shifted. Since I couldn't find a direct way to start at each character, I then rex it again but in sed mode, shifting the string one character (removing the first character and space). So, now we have several of the paths and the stations string shifted to look like this: "b d e f g h j m n p r s u v w y ".

This is repeated 5 times, each time getting the shifted strings. In the example string I've been using, "b d e f g ", etc... then "d e f g h" and so on.

| fields - station | stats count by path | sort - count

Then I remove the station field because it's no longer needed, then do a stats count by path. The final bit is just to sort it so the bigger counts are on top - not important for this, I just wanted to.

Richfez · ‎11-21-2015

It occurred to me that if you only needed any paths 5 stops or longer (instead of all precisely 5-stop paths and sub-paths), this search is quite a bit shorter.

index="temp_answers" sourcetype="csv" | transaction user maxpause=30s 
| search eventcount>=5 | fields user, station | mvcombine station

Obviously, fix up index and base search as required and adjust your maxpause (if necessary) on the transaction.

The, the rest is just as it was above. In my data this doesn't show any duplicated paths when run against all the 5 stops or more paths, but that's to be expected because I chose my data randomly whereas subway stops probably show definite patterns.

Did that answer your question? Have you tried applying this into your Sankey visualization?

hylam · ‎11-21-2015

Sankey doesn't show the subway map. I switched to a choropleth map with arrow-shaped islands overlayed on the geographical map. The islands are named station1_arrow_station2. The resultset contains from,to,count. I used paths of exactly 2 steps. This is an unorthodox application of the choropleth map visualization.

Richfez · ‎11-14-2015

So, what is the problem you are trying to solve?

hylam · ‎11-14-2015

Given the csv file of the abive schema, have u got any splunk solution?

List common substrings of at least 5 stations in a subway travel history of users

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life