Amazon Kinesis Modular Input: How to troubleshoot ...

ejharts2015 · ‎08-05-2016

We've recently started using the Kinesis Splunk Add-on for our clustered splunk environment. We have it installed on our heavy forwarder which then forwards the events onto the splunk cluster.

We've been having issues where the kinesis streams "lag" or fall behind realtime when bulk batches of log come in. Other readers from these streams (like ELK) aren't having this issue and seem to handle the increased load just fine, only Splunk is lagging behind. Kinesis shows we are below the threshold for read/writes.

You have any pointers/tips for where to start diagnosing this issue?

Damien_Dallimor · ‎08-05-2016

What is your architecture ? 1 instance of Kinesis Mod Input running on 1 single forwarder ?

Can you share your inputs.conf file ? I can show you how to enable multithreading ie: multiple worker instances consuming from the kinesis stream in parallel. You can see in the code , each thread spawned (driven by your inputs.conf file) , will get a unique workerID.

Associated with this is boosting your JVM memory also , which is simple to do in kinesis_ta/bin/kinesis.py

ejharts2015 · ‎08-08-2016

Our architecture:
- 3 clustered indexers (c4.2xlarge)
- 3 clustered search heads (c4.2xlarge)
- 1 heavy forwarder (which forwards to the indexers) - Kinesis TA is installed here. (c4.xlarge)
- 1 master/deployer combo box (c3.large)

I'd be interested in learning how to enable multithreading see if that makes a difference. We do have a separate testing environment so happy to try suggestions out!

According to a "ps aux" on the kinesis.py were running with args:
java -classpath /opt/splunk/etc/apps/kinesis_ta/bin/lib/* -Xms256m -Xmx256m

Our inputs in filled with stanzas such as the one below:
[kinesis://host-east]
app_name = host-east
aws_access_key_id = [the_key]
aws_secret_access_key = [the_secret]
hec_batch_mode = 0
hec_https = 0
host = host-east
index = kinesis-east
initial_stream_position = TRIM_HORIZON
kinesis_endpoint = https://kinesis.us-east-1.amazonaws.com
message_handler_impl = com.splunk.modinput.kinesis.JSONOnlyMessageHandler
output_type = stdout
sourcetype = kinesis
stream_name = host-logging

Damien_Dallimor · ‎08-09-2016

Will run 2 threads with common settings inherited from the parent kinesis stanza.

inputs.conf

[kinesis]
app_name = host-east
aws_access_key_id = [the_key]
aws_secret_access_key = [the_secret]
hec_batch_mode = 0
hec_https = 0
host = host-east
index = kinesis-east
initial_stream_position = TRIM_HORIZON
kinesis_endpoint = https://kinesis.us-east-1.amazonaws.com
message_handler_impl = com.splunk.modinput.kinesis.JSONOnlyMessageHandler
output_type = stdout
sourcetype = kinesis
stream_name = host-logging

[kinesis://host-east_thread1]
disabled=0

[kinesis://host-east_thread2]
disabled=0

ejharts2015 · ‎08-09-2016

We have an internal tool that can generate logs at a higher rate than our normal average which allows us to test this "lagging" behind real_time. There were no read/write provisioned throughput exceeded issues during the test. Here are the results with the above config:

ejharts2015 · ‎08-25-2016

Any other thoughts? We've tried decreasing the backoff time to 30 and increasing the retries to 100. But still are experiencing lag.

Whenever we try to change those default values too, I see this in the splunkd logs:

08-25-2016 17:44:55.734 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine checkpoint interval value, will revert to default value.

08-25-2016 17:44:55.733 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine hec port value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine hec poolsize value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_inactive_time_before_batch_flush value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_inactive_time_before_batch_flush value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_batch_size_bytes value, will revert to default value.

Amazon Kinesis Modular Input: How to troubleshoot why Kinesis streams are lagging behind real time?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms