I have a situation where a server is crashing as the result of a specific user accessing some specific web site. Don't know the user and don't know the web site. But the pattern is this:
When some user (call them userID) accesses some site (call it siteID) it causes the server to crash. All connected users drop, the server is automatically restarted, and within two minutes users are reconnecting. This results in a load pattern (number of connected users) to look something like this (each number is the event count sample interval)
55 57 47 65 52 58 1 2 6 11 23 34 45 56 58 62...
^
crash happens here
Where the event counts would drop by a quantifiable percentage, in this case, going from 58/interval to something close to 1/interval. Also, imagine that this pattern repeats over 24 hours where crashes occur at random times but at least once or twice per hour.
So, the query I want would be something like:
Find all the events in minute X where, if you compared their count to minute X-1, their count would be at least 10 times less than the count at time X. For the one minute time period of X-1 to X, give me the list of all distinctly unique userIDs and the siteID they accessed. Now scan 24 hours of data looking for this pattern and give me the union of all the data sets from each minute prior to the crash.
Finally, and here's what I'm looking for, the offending user. It is highly probably that just one user (or a very small set of users) is accessing a specific web site for which our server is getting tripped up and crashing when doing its thing for this site.
So, I want to find the userID that appears in (is common to) many of these X-1 intervals. Sort of like taking a bunch of X-1 data sets and performing and intersection of all uerIDs across each data set.
I tried doing something like this to build one X-1 data set:
host="webproxy01" earliest=5/3/2010:18:48:0 latest=5/3/2010:18:49:0 | fields userID, siteID| dedup userID
Did it this way because I know the time of each crash (to the nearest minute).
But then I have to create a bunch of saved queries (one for each X-1 interval) and run another query like this to get all the data:
| savedsearch A | append[savedsearch B] | append[savedsearch C] | stats count by userID | sort - count
I just started using Splunk (new user to Splunk) yesterday to analyze this problem. So, my approach is probably reflective of a newbie and I'm sure there must be a better, more generic query that eliminates the need to explicitly specify the earliest/latest intervals and let Splunk figure out those intervals by recognizing the pattern of the drastic drop in events between X and X-1.
As a final point, there are more than one server that I would want to look at, as well because, due to load balancing, the offending user would nearly certainly, randomly, connect to a different server in the pool when the crash occurs. And, of course, cause that server to crash eventually. 😞
Thanks in advance for any suggestions,
Mark
... View more