Combining similar log entries and counting as one

robert2138 · ‎07-04-2013

I have multiple users making a request to a web server each time they type a character into a search box. User 1 is typing 'please' and user 22 is typing 'cat'. Simplified log entries:

ip=10.10.0.1&q=p
ip=10.10.1.22&q=c
ip=10.10.0.1&q=pl
ip=10.10.0.1&q=ple
ip=10.10.1.22&q=ca
ip=10.10.0.1&q=plea
ip=10.10.1.22&q=cat
ip=10.10.0.1&q=pleas
ip=10.10.0.1&q=please

I would like to:

Count the number of requests that are all part of the same 'typing' action. So for Please it would be 6 and for Cat it be 3. It can be assumed that the user types a different word each time.
Count the number of distinct typing actions - so 1 for both users.

Any ideas?

Thanks
Robert

Gilberto_Castil · ‎07-04-2013

This is an interesting question. Assuming that the data is broken down in such a way that each occurrence is one event, then consider the following:

You can group the queries by ip very easily and that would give you a deterministic list of queries sent by each user. The problem is that listing the matches in this manner does not separate the potential of a match. For instance, let's add “car” to your data. Then the search above is not clearly reflective of the probabilistic match. That is: The queries match the source IP but the queries themselves are not necessarily related.

sourcetype="answers-1372961586" | stats list(q) AS q by ip

Clearly, “cat” and “car” do not belong together in this type of analysis.

Perhaps you might want to calculate the likeness of a match based on the inclusive characters of each query. In this manner you have an empirical match. Think of this like working with two identical arrays in a two-dimentional plane. You are checking each value against a corresponding cell in that matrix. For instance

  sourcetype="answers-1372961586" 
| stats list(q) AS q 
| eval match=q 
| mvexpand match 
| eval match=match."%" 
| mvexpand q 
| where like(q,match) 
| eval match=rtrim(match,"%") 
| stats list(match) AS match count by q 
| sort - count

In addition, you can also observe a probabilistic measurement of a query as the user types. For instance, if you were to isolate the query for “please”, you’ll notice the probable outcome based on potential match choices.

sourcetype="answers-1372961586" 
| stats list(q) AS q 
| eval match=q 
| mvexpand match 
| eval match=match."%" 
| mvexpand q 
| where like(q,match) 
| eval match=rtrim(match,"%") 
| eval q_length=len(q) 
| eval match_length=len(match) 
| eval likeness=(round(match_length/q_length,2) * 100)." %"
| stats list(match) AS match list(likeness) AS likeness by q 
| search q="please"

Thus, when the end user has typed in at least four characters p,l,e,a, there is an approximate potential of 67% that the query will match the word “please”.

Of course, this is based this minimal data set and with a cursory evaluation of character match… I am sure there are more intelligent ways to predict this in real life.

gkanapathy · ‎07-04-2013

you can try starting with ... | stats values(q) as vq by ip and looking at | eval c=mvcount(vq). Though if you're needing to deal with multiple lines, you'll probably need the transaction command.

Combining similar log entries and counting as one

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!