I have multiple users making a request to a web server each time they type a character into a search box. User 1 is typing 'please' and user 22 is typing 'cat'. Simplified log entries:
ip=10.10.0.1&q=p
ip=10.10.1.22&q=c
ip=10.10.0.1&q=pl
ip=10.10.0.1&q=ple
ip=10.10.1.22&q=ca
ip=10.10.0.1&q=plea
ip=10.10.1.22&q=cat
ip=10.10.0.1&q=pleas
ip=10.10.0.1&q=please
I would like to:
Count the number of requests that are all part of the same 'typing' action. So for Please it would be 6 and for Cat it be 3. It can be assumed that the user types a different word each time.
Count the number of distinct typing actions - so 1 for both users.
Any ideas?
Thanks
Robert
This is an interesting question. Assuming that the data is broken down in such a way that each occurrence is one event, then consider the following:
You can group the queries by ip very easily and that would give you a deterministic list of queries sent by each user. The problem is that listing the matches in this manner does not separate the potential of a match. For instance, let's add “car” to your data. Then the search above is not clearly reflective of the probabilistic match. That is: The queries match the source IP but the queries themselves are not necessarily related.
sourcetype="answers-1372961586" | stats list(q) AS q by ip
Clearly, “cat” and “car” do not belong together in this type of analysis.
Perhaps you might want to calculate the likeness of a match based on the inclusive characters of each query. In this manner you have an empirical match. Think of this like working with two identical arrays in a two-dimentional plane. You are checking each value against a corresponding cell in that matrix. For instance
sourcetype="answers-1372961586"
| stats list(q) AS q
| eval match=q
| mvexpand match
| eval match=match."%"
| mvexpand q
| where like(q,match)
| eval match=rtrim(match,"%")
| stats list(match) AS match count by q
| sort - count
In addition, you can also observe a probabilistic measurement of a query as the user types. For instance, if you were to isolate the query for “please”, you’ll notice the probable outcome based on potential match choices.
sourcetype="answers-1372961586"
| stats list(q) AS q
| eval match=q
| mvexpand match
| eval match=match."%"
| mvexpand q
| where like(q,match)
| eval match=rtrim(match,"%")
| eval q_length=len(q)
| eval match_length=len(match)
| eval likeness=(round(match_length/q_length,2) * 100)." %"
| stats list(match) AS match list(likeness) AS likeness by q
| search q="please"
Thus, when the end user has typed in at least four characters p,l,e,a, there is an approximate potential of 67% that the query will match the word “please”.
Of course, this is based this minimal data set and with a cursory evaluation of character match… I am sure there are more intelligent ways to predict this in real life.
you can try starting with ... | stats values(q) as vq by ip
and looking at | eval c=mvcount(vq)
. Though if you're needing to deal with multiple lines, you'll probably need the transaction
command.