All Apps and Add-ons

Splunk DB Connect 3: Why am I getting HEC 503 Errors?

Muryoutaisuu
Communicator

Hello guys

We have a dedicated heavyforwarder instance, that is used as a database connector with the App splunk_app_db_connect. We have about 100 enabled inputs and the very most of them run every 60 seconds on follow tail basis. On most inputs max_row is configured to 10000000, but usually the returned number of events is nowhere near that limit.

We are getting these irregularly recurring error messages in /opt/splunk/var/log/splunk/splunk_app_db_connect_server.log from multiple (3-10) inputs at the same time:

2018-01-05 10:15:17.509 +0100  [QuartzScheduler_Worker-30] ERROR org.easybatch.core.job.BatchJob - Unable to write records
java.io.IOException: HTTP Error 503: Service Unavailable
    at com.splunk.dbx.server.dbinput.recordwriter.HttpEventCollector.uploadEventBatch(HttpEventCollector.java:112)
    at com.splunk.dbx.server.dbinput.recordwriter.HttpEventCollector.uploadEvents(HttpEventCollector.java:89)
    at com.splunk.dbx.server.dbinput.task.processors.HecEventWriter.writeRecords(HecEventWriter.java:48)
    at org.easybatch.core.job.BatchJob.writeBatch(BatchJob.java:203)
    at org.easybatch.core.job.BatchJob.call(BatchJob.java:79)
    at org.easybatch.extensions.quartz.Job.execute(Job.java:59)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)

The stacktrace indicates that the issue may lie within HEC (HttpEventCollector). Is it possible, that HEC is overloaded? HEC is currently configured with the default settings.

In splunk_httpinput/default/inputs.conf we saw following options:

[http]
dedicatedIoThreads=2
maxThreads = 0
maxSockets = 0

But in the documentation on http://dev.splunk.com/view/event-collector/SP-CAAAE6Q#httpstanza I found the following for dedicatedIoThreads:

The number of dispatcher threads on the HTTP Event Collector server. The default value is 2. This setting should not be altered unless you have been requested to do so by Splunk Support. The value of this parameter should never be more than the number of physical CPU cores on your Splunk Enterprise server.

May somebody pinpoint me, whether setting the option dedicatedIoThreads higher may indeed resolve my problem? Sadly we get those error only on our productive platform and we'd rather not tinker with options that shouldn't be changed without instructions.

Version notes:
Splunk version: 6.6.3 (build e21ee54bc796)
splunk_app_db_connect version: 3.1.0

1 Solution

Muryoutaisuu
Communicator

Upgrading to Version 3.1.3 seems to have resolved this issue in our case

View solution in original post

Muryoutaisuu
Communicator

Upgrading to Version 3.1.3 seems to have resolved this issue in our case

Tbmaness
New Member

Not sure if your issue has been resolved, but this is what worked for me. I was having this exact issue with my implementation and the problem turned out to be "File Ownership". I noticed all my application were running as the "Splunk" user and "./splunk_app_db_connect" had root as the owner. Change ownership of the entire app directory to match all of your other apps.

For example: I ran 'chown -R splunk:splunk ./splunk_app_db_connect' on the directory to assign splunk as owner of all the files. Obviously assign whichever user you use in your implementation.

0 Karma

dvergnes_splunk
Splunk Employee
Splunk Employee

Hi,

503 means that HEC queue to send to the indexers is full. The problem is to identify where is the bottleneck:
1) is it the heavy forwarder?
2) is the network between the heavy forwarder and the indexers?
3) is it the indexers?

To help you diagnose, you can check following things:
- CPU usage on heavy forwarder and indexers
- queue size on indexers

If the CPU is high (more than 90%) in one of the component, that's where you should focus to troubleshoot. If everything seems normal it might be a network issue.

Finally, another test you can have to troubleshoot is to index on the heavy forwarder locally. After doing so, there is no 503 anymore, the bottleneck is clearly downstream (network or indexers).

harsmarvania57
SplunkTrust
SplunkTrust

I'll suggest to open support case with splunk as this error related to HEC in DB Connect app not separate HEC.

0 Karma

p_gurav
Champion
0 Karma

Muryoutaisuu
Communicator

Hi p_gurav
That's not the issue. We haven't configured the HEC to run as if on a deploymentserver.

$ grep useDeploymentServer /opt/splunk/etc/apps/splunk_httpinput/*/inputs.conf
/opt/splunk/etc/apps/splunk_httpinput/default/inputs.conf:useDeploymentServer=0
0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...