Solved: Splunk Add-on for Amazon Web Services: Why do I st...

nickpayze · ‎12-08-2015

I am pulling data from 30-40 log groups from 3 different regions using the Splunk Add-on for AWS. I am having an issue where after about 10-15 minutes, I stop receiving the most up to date events from half of my log groups. I receive data initially just fine from all log groups, but it seems after it pulls the most recent data at the time it doesn't check again for more data. The delay and interval settings are set to the default and I've confirmed that the most current events are being received by the Cloudwatch logs service. My only clue seems to be this event in the Splunk internal logs that occurs for my log groups with this issue.

2015-12-08 17:52:22,328 INFO pid=7026 tid=Thread-298 file=aws_cloudwatch_logs.py:_do_was_job_func:130 | Previous job of the same task still running. Exit current job. region=us-west-2, log_group=syslog

This event seems to occur indefinitely every 10 minutes and Splunk never pulls more data from the log group again.

Any ideas?

nickpayze · ‎01-29-2016

The latest amazon add-on version I updated to (3.0.0) has fixed the issue.

View solution in original post

briancronrath · ‎10-02-2017

I was able to get around this issue by limiting the time range for the data it is polling. This is under the Splunk Add-on for AWS console -> Inputs -> Actions -> Edit -> Templates

Specifically the "Only After" value

henrikhuitti · ‎10-24-2016

We resolved this issue with changing from direct cloudwatch logs to Kinesis, please check http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

We also got answer from AWS:

. Instead you should use the Kinesis subscription integration that Splunk apparently provides, but does not use by default. The default Splunk integration only works for very small customers. You should reach out to Splunk for support if needed on how to use Splunk with CloudWatch Logs.

nickpayze · ‎01-29-2016

The latest amazon add-on version I updated to (3.0.0) has fixed the issue.

amiller100 · ‎10-24-2016

I am also seeing the same throttling alerts in 4.1.1

henrikhuitti · ‎10-04-2016

Can confirm, throttling errors with version 4.1.0 and only 11 cloudwatch logs logstreams.

Failure in describing cloudwatch logs streams due to throttling exception for log_group=, sleep=5.98632069244, reason=Traceback (most recent call last):
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/cloudwatch_logs_mod/aws_cloudwatch_logs_data_loader.py", line 64, in describe_cloudwatch_log_streams
    group_name, next_token=buf["nextToken"])
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
    body=json.dumps(params))
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
    body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'__type': u'ThrottlingException', u'message': u'Rate exceeded'}

wsh · ‎04-11-2016

For what it's worth, @nickpayze, I'm seeing this on 3.0.0. 😞 Same throttling exception that you saw

lcasey001 · ‎10-04-2016

We have this same issue running latest 4.1.0 version. It seems to try to run describe_log_stream against all log_groups at the same time which is probably causing the throttling. This is especially an issue when you have a large set of log_groups.

gsumner · ‎08-02-2016

Also seeing this issue on 4.0.0

nickpayze · ‎01-12-2016

I found a Throttling exception ERROR in the internal logs that may be another clue, could this be the culprit?:

2015-12-10 16:21:51,357 ERROR pid=24928 tid=Thread-23 file=util.py:describe_cloudwatch_log_streams:118 | Failure in describing cloudwatch logs streams due to throttling exception for log_group=kern.log, sleep=5.96629281236, reason=Traceback (most recent call last):
File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/aws_cloudwatch_logs_resources/util.py", line 108, in describe_cloudwatch_log_streams
    group_name, next_token=buf["nextToken"])
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 308, in describe_log_streams
    body=json.dumps(params))
  File "/opt/splunk/etc/apps/Splunk_TA_aws/bin/boto/logs/layer1.py", line 576, in make_request
    body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'message': u'Rate exceeded', u'__type': u'ThrottlingException'}

kyleguillot · ‎12-28-2015

I'm seeing the same behavior with Splunk running on Windows 7

bwooden · ‎12-09-2015

What OS is being used to host Splunk?

nickpayze · ‎12-09-2015

Ubuntu 14.04

bwooden · ‎12-09-2015

Ubuntu's dash shell returns a different SIGTERM than bash, resulting in orphaned input processes. This was meant to have been resolved in TA version 2.0.1 (which is why rpille asked which version). At first glance, it appears this condition is being detected and partially handled (additional processes aren't spawned when orphaned processes exist, yet the orphaned process is not terminated). I'll file a new bug for this and explore workarounds.

bwooden · ‎12-09-2015

Hi @nickpayze, can you try adding a start_by_shell=false to the [aws_cloudwatch_logs]configuration in inputs.conf and re-starting Splunk?

nickpayze · ‎12-10-2015

Will I have to wait until this issue is resolved in the next version of the aws add-on?

azhang_splunk · ‎12-11-2015

Would you turn on the debug log and double check if you can find log message "Start to describe streams **" and "Job ended. region **" for each interval? The log group name should be print out in those message.

nickpayze · ‎12-14-2015

I do not see any "Job ended" messages for any of my log groups.

I see many "Start to describe streams" messages for the log groups I am still receiving events on (every few seconds) and the " Previous job of the same task still running" message running every 10 minutes for the log groups I stopped receiving events on.

nickpayze · ‎12-09-2015

I've added the setting and it does get rid of the bash process that runs alongside the python process for aws_cloudwatch_logs.py . I am still getting the same behavior as before though. 😞

rpille_splunk · ‎12-08-2015

What version of the add-on are you running?

nickpayze · ‎12-09-2015

version 2.0.1

Also one thing I forgot to specify, when I restart the splunk server, it follows the same behavior as described above, pulls all data from all logs again up to most recent, then stops and shows that message.

Splunk Add-on for Amazon Web Services: Why do I stop receiving events from some of my Cloudwatch log log-groups?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!