Solved: How do you handle traffic spikes from servers?

twinspop · ‎04-28-2011

Our developers tend to use syslog, um, carelessly. For example, one server yesterday decided to send out 1000 identical msgs per second to let us know its DB instance was down. By the time it was taken care of, our license was busted on that indexer, again. Too many violations this month, so we're down hard.

I'm thinking of crafting a scheduled search like so:

startminutesago=60 | eval size=length(_raw)/(1024*1024) 
  | stats sum(size) as MB by HOST | where MB>50

Based on that search, I'd like to set-up an alert script that would grab the offending servers' IPs, run "iptables -I INPUT 1 -s $IP -j DROP", and send out an email/snmp-trap that this has occurred.

However, with a distributed environment this task grows a little in complexity. Schedule the search on every indexer? Or only on the search head, and then make the script capable of sending the iptables commands to the indexers? Neither solution seems ideal.

So how do you deal with the occasional big spike in traffic? I'm trying to avoid manual intervention because we often get these spikes in the dead of night and I like to sleep.

hazekamp · ‎04-28-2011

I would recommend using Splunk's internal metrics for this:

index=_internal source=*metrics* group=per_host_thruput | rename series as host | eval MB=kb/1024 | stats sum(MB) as MB by host

Then save and schedule this search to run over your desired time window. Set the alert to trigger when MB>50 and trigger a script. The script will be responsible for taking the hosts identified by the search, running iptables, and sending an email/trap.

You could have the Splunk alert handle the email part as well depending on the manner by which you want to notify.

View solution in original post

hazekamp · ‎04-28-2011

I would recommend using Splunk's internal metrics for this:

index=_internal source=*metrics* group=per_host_thruput | rename series as host | eval MB=kb/1024 | stats sum(MB) as MB by host

Then save and schedule this search to run over your desired time window. Set the alert to trigger when MB>50 and trigger a script. The script will be responsible for taking the hosts identified by the search, running iptables, and sending an email/trap.

You could have the Splunk alert handle the email part as well depending on the manner by which you want to notify.

JSapienza · ‎04-28-2011

Before 4.2 it was messy.
But, now that I have been using the Deployment app in 4.2 its been a breeze. I specifically use the "Forwarders Sending More Data Than Expected" search with an alert set to fire when any forwarder hits 20% over of its "Average Daily KBps" .This search uses the forwarder_metrics which seems to be pretty reliable. I also have an alert set to fire if we hit 80% of our daily indexing license volume. This way I have the option to stop a forwarder or an indexer to prevent the license bust. At this time I am handling it manually.
However, I could use "Run Script" action on the alerts to kick off a script to remotely stop the forwarder, indexer or any other appropriate action.

twinspop · ‎04-28-2011

I need to look into this. That sounds interesting. Unfortunately, a lot of my data comes from syslog direct to Splunk.

netwrkr · ‎04-28-2011

"I'm trying to avoid manual intervention because we often get these spikes in the dead of night and I like to sleep."

amen to that. Even better is when a disk error occurs and spews 10 mil lines of logs in ~5 minutes.

How do you handle traffic spikes from servers?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life