All Apps and Add-ons

Amazon Kinesis Modular Input: How to troubleshoot why Kinesis streams are lagging behind real time?

ejharts2015
Communicator

We've recently started using the Kinesis Splunk Add-on for our clustered splunk environment. We have it installed on our heavy forwarder which then forwards the events onto the splunk cluster.

We've been having issues where the kinesis streams "lag" or fall behind realtime when bulk batches of log come in. Other readers from these streams (like ELK) aren't having this issue and seem to handle the increased load just fine, only Splunk is lagging behind. Kinesis shows we are below the threshold for read/writes.

You have any pointers/tips for where to start diagnosing this issue?

0 Karma

Damien_Dallimor
Ultra Champion

What is your architecture ? 1 instance of Kinesis Mod Input running on 1 single forwarder ?

Can you share your inputs.conf file ? I can show you how to enable multithreading ie: multiple worker instances consuming from the kinesis stream in parallel. You can see in the code , each thread spawned (driven by your inputs.conf file) , will get a unique workerID.

Associated with this is boosting your JVM memory also , which is simple to do in kinesis_ta/bin/kinesis.py

0 Karma

ejharts2015
Communicator

Our architecture:
- 3 clustered indexers (c4.2xlarge)
- 3 clustered search heads (c4.2xlarge)
- 1 heavy forwarder (which forwards to the indexers) - Kinesis TA is installed here. (c4.xlarge)
- 1 master/deployer combo box (c3.large)

I'd be interested in learning how to enable multithreading see if that makes a difference. We do have a separate testing environment so happy to try suggestions out!

According to a "ps aux" on the kinesis.py were running with args:
java -classpath /opt/splunk/etc/apps/kinesis_ta/bin/lib/* -Xms256m -Xmx256m

Our inputs in filled with stanzas such as the one below:
[kinesis://host-east]
app_name = host-east
aws_access_key_id = [the_key]
aws_secret_access_key = [the_secret]
hec_batch_mode = 0
hec_https = 0
host = host-east
index = kinesis-east
initial_stream_position = TRIM_HORIZON
kinesis_endpoint = https://kinesis.us-east-1.amazonaws.com
message_handler_impl = com.splunk.modinput.kinesis.JSONOnlyMessageHandler
output_type = stdout
sourcetype = kinesis
stream_name = host-logging

0 Karma

Damien_Dallimor
Ultra Champion

Will run 2 threads with common settings inherited from the parent kinesis stanza.

inputs.conf

[kinesis]
app_name = host-east
aws_access_key_id = [the_key]
aws_secret_access_key = [the_secret]
hec_batch_mode = 0
hec_https = 0
host = host-east
index = kinesis-east
initial_stream_position = TRIM_HORIZON
kinesis_endpoint = https://kinesis.us-east-1.amazonaws.com
message_handler_impl = com.splunk.modinput.kinesis.JSONOnlyMessageHandler
output_type = stdout
sourcetype = kinesis
stream_name = host-logging

[kinesis://host-east_thread1]
disabled=0

[kinesis://host-east_thread2]
disabled=0
0 Karma

ejharts2015
Communicator

We have an internal tool that can generate logs at a higher rate than our normal average which allows us to test this "lagging" behind real_time. There were no read/write provisioned throughput exceeded issues during the test. Here are the results with the above config:

Before MultiThreading:
| Test #5 - 7/28 (Started 14:35:19 & Stopped: 14:40:27 = 5.13 minutes)
| Shards - 4 shards
| Settings: TRIM_HORIZON
| Total Time Lagging: 14:34 to 14:41 (7 minutes)
| Max lag: 23 seconds (3 seconds = norm)
| Total Logs = 52,517

With MutliThreading:
| Test #8 - 8/9 (Started 10:24 & Stopped 10:31 = 7 minutes)
| Shards - 4 shards
| Settings: TRIM_HORIZON
| Total Time Lagging: 10:25 - 10:43 (18 minutes)
| Max lag: 480 seconds (3 seconds = norm)
| Total Logs = 52,185

0 Karma

ejharts2015
Communicator

Any other thoughts? We've tried decreasing the backoff time to 30 and increasing the retries to 100. But still are experiencing lag.

Whenever we try to change those default values too, I see this in the splunkd logs:

08-25-2016 17:44:55.734 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine checkpoint interval value, will revert to default value.

08-25-2016 17:44:55.733 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine hec port value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine hec poolsize value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_inactive_time_before_batch_flush value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_inactive_time_before_batch_flush value, will revert to default value.

08-25-2016 17:44:55.732 +0000 ERROR ModularInputs - <stderr> Argument validation for scheme=kinesis:  Can't determine max_batch_size_bytes value, will revert to default value.
0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...