All Apps and Add-ons

What is causing Authentication Failures between Scheduler and DCN for Splunk Add-on for NetApp Data ONTAP?

timothywatson
Path Finder
  • Search Head running Splunk 6.5.2
  • DCN is running Splunk 6.6.1
  • Splunk Add-on for NetApp Data ONTAP (Splunk_TA_ontap) is version 2.1.5, with AddOn for VMWare version 3.3.2 "over the top" as advised for newest Helpers
  • Removed un-needed SAs and TAs as directed
  • Search Head requires https for Splunk Web (using default splunk certs)
  • Scheduler runs on Search Head (inputs)
  • DCN is Heavy Forwarder and also requires https for Splunk Web (using default splunk certs)
  • DCN has allowRemoteLogin=always in "local" server.conf
  • all Workers are enabled on DCN (inputs)

Lots of seemingly random "broken pipes" and other communications issues with the NetApp Appliances, though the Scheduler setup confirms connectivity and authentication is good. Likewise, setup page confirms connectivity to the DCN with correct password (that is not "changeme" ;).

Another symptom is that splunk eventually stops on the DCN, though I don't see anything in splunkd.log indicating anything fatal before it simply stops logging!

I definitely understand the layout of the Scheduler, DCN, Hydra, etc - there is just a wrinkle somewhere that is not allowing splunk to settle into a stable pattern.

Here's an error message from _internal with DEBUG level information included:

2017-06-30 16:16:33,062 ERROR [ta_ontap_collection_scheduler://nidhogg] Problem with hydra scheduler ta_ontap_collection_scheduler://nidhogg:
 [HTTP 401] Client is not authenticated
Traceback (most recent call last):
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 2102, in run
    self.node_manifest = self.establishNodeManifest()
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/hydra_scheduler.py", line 2019, in establishNodeManifest
    for node_stanza in node_stanzas:
  File "/opt/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 133, in _result_iter
    self._fill_cache()
  File "/opt/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 140, in _fill_cache
    self._results_cache.append(self._iter.next())
  File "/opt/splunk/etc/apps/SA-Hydra/bin/hydra/models.py", line 256, in iterator
    sessionKey=self._sessionKey)
  File "/opt/splunk/lib/python2.7/site-packages/splunk/models/base.py", line 297, in get_entities
    return splunk.entity.getEntities(self.manager.model.resource, unique_key='id', uri=self._uri, namespace=self._namespace, owner=self._owner, **kwargs)
  File "/opt/splunk/lib/python2.7/site-packages/splunk/entity.py", line 129, in getEntities
    atomFeed = _getEntitiesAtomFeed(entityPath, namespace, owner, search, count, offset, sort_key, sort_dir, sessionKey, uri, hostPath, **kwargs)
  File "/opt/splunk/lib/python2.7/site-packages/splunk/entity.py", line 222, in _getEntitiesAtomFeed
    serverResponse, serverContent = rest.simpleRequest(uri, getargs=kwargs, sessionKey=sessionKey, raiseAllErrors=True)
  File "/opt/splunk/lib/python2.7/site-packages/splunk/rest/__init__.py", line 530, in simpleRequest
    raise splunk.AuthenticationFailed
AuthenticationFailed: [HTTP 401] Client is not authenticated

Any ideas from the intrepid initiates of ONTAP?

0 Karma

timothywatson
Path Finder

To whom it may concern. I continue to work through stability issues with both the ONTAP and the VMWare Data Collection Nodes. Here is what I have done so far - with limited results:
- Ensured that the DCNs and the Search Heads (where the Schedulers run) are on the same version of Splunk
- Ensured that the DCN was a full Heavy Forwarder
- METICULOUSLY followed the Setup Instructions in the TA Installation Guide, followed by METICULOUSLY following the Setup Instructions in the App Installation Guide

The ONTAP Data Collection Node seems to be reasonably well-behaved, often running for more than a week without failure. I noted "authentication failures" in the logs, but have found that it collects data just fine even when those errors are streaming every few seconds. As such, I simply monitor for the condition where no ONTAP data has been collected in over an hour. When it does fail, restarting Splunk is not sufficient to recover Data Collection. In every case, I must reboot the whole machine to recover. I find that HIGHLY UNUSUAL for a Linux Server!

The VMWare Data Collection Node seems much more fussy - RARELY running for a whole week without a failure. In this case, the logs are full of (paraphrasing) "unrecoverable socket communication errors". As above, restarting Splunk is not sufficient to recover - Linux must be fully rebooted to reestablish communications and data collection.

Helpful Search Terms:
- index = ontap earliest=-59m | head 1 (run this hourly to check for a stalled DCN. Alert Condition = "results < 1")
- index = vmware* earliest=-59m | head 1 (run this hourly to check for a stalled DCN. Alert Condition = "results < 1")

Lessons Learned:
- Follow Instructions Carefully
- Don't Freak Out (that's a tough one!)
- Does anyone know whether the VM Images for the DCNs work flawlessly???
- Has anyone gotten their own hand-crafted DCNs to work flawlessly???

0 Karma

timothywatson
Path Finder

Come on Splunk! This is bogus!

I create the link to the DCN in the Scheduler and there is no difficulty validating the link/port/user/password/etc. But the RUNTIME cannot Authenticate??? What could possibly introduce an Authentication Error in the Runtime when the Scheduler connects EVERY time???

Please respond. Is this a known problem or am I just REALLY unlucky?

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...