All Apps and Add-ons

Is there any way to speed up eStreamer eNcore data collection?

att35
Builder

Hi,

We are running the eSteamer eNcore app with default settings but it looks like it's not able to catch up with the FMC. Currently we are receiving data with time-stamp of Aug 7th 7:53 PM. We do see the time-stamp moving forward but it's extremely slow.

Tried changing the monitor.period setting under estreamer.conf to 30 seconds and even tried 1 sec just to see what happens but eNcore log files are still generated at the interval of 120 seconds, which is the default setting app came with. Splunk has been restarted after the change but the new setting doesn't seem to take effect.

Period changed to 10 seconds:
alt text

Log monitoring frequency still @ 2 minutes:
alt text

Has anyone been able to successfully modify the default settings for this app?

Thanks,

~ Abhi

1 Solution

sastrach
Path Finder

There is no way to speed up eStreamer eNcore through configuration today.

Your options are:

  1. Limit the events being sent by the server (e.g. turn off connections)
  2. Increase CPU power (if you're running in a VM then give at least 2 cores)
  3. Ensure a good network connection to the FMC
  4. Ensure you have good output disk performance

A few things worth noting

  • monitor.period refers to the frequency with which the monitor thread asks encore how it's doing. You will see the results of this in sourcetype=cisco:estreamer:log. It has nothing to do with performance. In fact the more frequently you ask, the slower encore will be. I would recommend you keep this at 120 seconds.
  • The difference in file timestamp has nothing to do with monitor.period. Each file is written to until it has 10,000 lines in it (configurable, although not in the Splunk UI). What we're seeing in your screenshot is that on average you're writing about 10000 events per 2 minutes - which is 83 events per second.
  • I have managed to get 350 events/sec using just one core of a Celeron CPU on a VM (and around 1200 using an i7 with SSD). At 83ev/s I don't think encore will break a sweat.

The question is: if encore is not catching up... is it falling behind? If it's not falling behind then I am willing to bet that the FMC is just delayed in its reporting from sensors. If it is falling behind then it's definitely worth further investigation to find out where the bottleneck is.

EDIT:
After running the Splunk query: sourcetype="cisco:estreamer:log" (ERROR OR WARNING) there was an error like:

2017-08-10 11:51:30,783 estreamer.subscriber ERROR EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\nTraceback (most recent call last):\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 219, in start\n self.__tryHandleNextResponse()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 158, in __tryHandleNextResponse\n message = self.connection.response()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/connection.py", line 203, in response\n raise estreamer.EncoreException( errMsg.format( version, str(message) ))\nEncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\n

The important bit: EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987

This is a known FMC bug affecting all eStreamer clients: eStreamer keeps crashing and encore keeps reconnecting and stumbling on as best it can. It's probably running only a few seconds out of every 2 minutes and is therefore very slow on average. TAC should be able to sort you out with the correct patch if you pass them the quoted exception.

View solution in original post

0 Karma

sastrach
Path Finder

There is no way to speed up eStreamer eNcore through configuration today.

Your options are:

  1. Limit the events being sent by the server (e.g. turn off connections)
  2. Increase CPU power (if you're running in a VM then give at least 2 cores)
  3. Ensure a good network connection to the FMC
  4. Ensure you have good output disk performance

A few things worth noting

  • monitor.period refers to the frequency with which the monitor thread asks encore how it's doing. You will see the results of this in sourcetype=cisco:estreamer:log. It has nothing to do with performance. In fact the more frequently you ask, the slower encore will be. I would recommend you keep this at 120 seconds.
  • The difference in file timestamp has nothing to do with monitor.period. Each file is written to until it has 10,000 lines in it (configurable, although not in the Splunk UI). What we're seeing in your screenshot is that on average you're writing about 10000 events per 2 minutes - which is 83 events per second.
  • I have managed to get 350 events/sec using just one core of a Celeron CPU on a VM (and around 1200 using an i7 with SSD). At 83ev/s I don't think encore will break a sweat.

The question is: if encore is not catching up... is it falling behind? If it's not falling behind then I am willing to bet that the FMC is just delayed in its reporting from sensors. If it is falling behind then it's definitely worth further investigation to find out where the bottleneck is.

EDIT:
After running the Splunk query: sourcetype="cisco:estreamer:log" (ERROR OR WARNING) there was an error like:

2017-08-10 11:51:30,783 estreamer.subscriber ERROR EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\nTraceback (most recent call last):\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 219, in start\n self.__tryHandleNextResponse()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 158, in __tryHandleNextResponse\n message = self.connection.response()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/connection.py", line 203, in response\n raise estreamer.EncoreException( errMsg.format( version, str(message) ))\nEncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\n

The important bit: EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987

This is a known FMC bug affecting all eStreamer clients: eStreamer keeps crashing and encore keeps reconnecting and stumbling on as best it can. It's probably running only a few seconds out of every 2 minutes and is therefore very slow on average. TAC should be able to sort you out with the correct patch if you pass them the quoted exception.

0 Karma

Sp1unkin_it
Explorer

Hi there,

We are on FMC 6.1.0.4

Currently we are seeing a large delay in events being indexed into Splunk (about 30-45 minutes).

I have velocity enabled and it is wavering between -.7X and 5.0

Aside from increasing "MaxQueueSize" are there any other parameters we should look to modify in an effort to balance out our performance to eliminate this lag?

Or are there things we can do to see if the issue is from the FWC side?

0 Karma

smitra_splunk
Splunk Employee
Splunk Employee

I'm having this issue too, FMC is patched though, no errors in eStreamer client logs. Have your logs caught up in real time ?

Sp1unkin_it
Explorer

Hi there smitra,

We tried a number of things, we were in contact with Cisco TAC. They had not really made any progress into resolution when our infrastructure team migrated the servers we were using from being on AMD based hardware to Intel hardware. Thereafter the problem went away for us.

If you have access to Cisco TAC you can get them involved. I'm sorry I don't have more information on how to resolve. We were kind of shocked to notice that the issue went away as well.

0 Karma

smitra_splunk
Splunk Employee
Splunk Employee

Thanks for your reply. We had Cisco TAC patch the bug CSCve44987 which fixed the issue of logs actually being sent to the eStreamer client. We see high log volumes but lagging in time and obviously eStreamer client can't keep up with FMC eStreamer server or server can't provide logs in real-time to client. As per your experience it looks like we have to engage Cisco TAC support again.

0 Karma

Sp1unkin_it
Explorer

Just curious, are you on AMD hardware? Your FMC or your eStreamer server?

0 Karma

att35
Builder

Issue got resolved after deploying the patch for CSCve44987.

Thanks,

~ Abhi

0 Karma

sastrach
Path Finder

Great news! Glad to hear it's sorted!

0 Karma

att35
Builder

Thanks Sam.

I was interpreting monitor.period completely wrong. It has been changed back to 120.

Server receiving these events from FMC has plenty of resources to spare(Physical server with 32 Cores and 500 G RAM). Logs are being stores on a SAN drive but the connection & speed is good. We do not see any issues there.. Network connection is good as well(Both FMC and Indexer are in same subnet)

While using the older version of this app, we had enabled all the events and were able to index events in real-time without any delay.

But in our case, it looks like eNcore is not even writing 10,000 lines.

[root@ data]# ls -lh
total 25M
-rw-------. 1 root root 1.7M Aug 10 09:45 encore.1502372743.log
-rw-------. 1 root root 1.4M Aug 10 09:47 encore.1502372866.log
-rw-------. 1 root root 2.9M Aug 10 09:49 encore.1502372990.log
-rw-------. 1 root root 1.1M Aug 10 09:51 encore.1502373113.log
-rw-------. 1 root root 328K Aug 10 09:54 encore.1502373237.log
-rw-------. 1 root root 252K Aug 10 09:56 encore.1502373362.log
-rw-------. 1 root root 1.1M Aug 10 09:58 encore.1502373486.log
-rw-------. 1 root root 2.4M Aug 10 10:00 encore.1502373610.log
-rw-------. 1 root root 319K Aug 10 10:02 encore.1502373733.log
-rw-------. 1 root root 319K Aug 10 10:04 encore.1502373857.log
-rw-------. 1 root root 3.6M Aug 10 10:06 encore.1502373981.log
-rw-------. 1 root root 2.2M Aug 10 10:08 encore.1502374105.log
-rw-------. 1 root root 1.3M Aug 10 10:10 encore.1502374228.log
-rw-------. 1 root root 1.3M Aug 10 10:12 encore.1502374353.log
-rw-------. 1 root root 1.2M Aug 10 10:14 encore.1502374476.log
-rw-------. 1 root root 1.2M Aug 10 10:16 encore.1502374600.log
-rw-------. 1 root root 1.3M Aug 10 10:18 encore.1502374723.log
-rw-------. 1 root root 1.2M Aug 10 10:20 encore.1502374847.log
[root@ data]#
[root@ data]# cat encore.1502374847.log | wc -l
720
 [root@ data]# cat encore.1502373981.log | wc -l
1985

I'll try disabling "Connection events" and enable logs that are less verbose but I think issue is something else.
We are getting data with a day old timestamp. https://answers.splunk.com/answers/560550/unable-to-see-all-events-indexed-by-splunk-via-est.html

Have opened a case with Cisco TAC and provided all the logs to them. Waiting for a response.

Thanks,

~ Abhi

0 Karma

sastrach
Path Finder

Hi Abhi!

So a few questions / suggestions for you:
1. Is the data falling behind? Or remaining stable - i.e. is it always about 24 hours behind?
2. Can you the following Splunk query and paste the results please? sourcetype="cisco:estreamer:log" (ERROR OR WARNING) - it's very odd that the files are not filling up. My expectation is that would only happen if encore shuts down and starts up again. Now that I think about it, the input script is configured to restart every 2 minutes if it shuts down... so that may be it.
3. Which version of FMC are you running?

Sam

0 Karma

att35
Builder

It looks like data is moving forward but extremely slowly. e.g. since restarting eNcore this morning @ 9:45 Eastern, it had successfully logged files until 11:53 (it stopped creating files after that for some reason. We also see this behavior every morning when all old files have been purged and data directory is blank but that’s another issue it seems). Now between 9:45 – 11:53, timestamp has only moved from August 9, 2017 10:00:10 AM to August 9, 2017 10:00:21 AM. That’s just 11 seconds. With this rate, even if it’s lagging behind, eNcore will never catch up.

The only two errors we see being repeated with each execution of monitor(120 Sec) are the following:

2017-08-10 11:53:26,291 estreamer.handler INFO Error state. Clearing queue

2017-08-10 11:51:30,783 estreamer.subscriber ERROR EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\nTraceback (most recent call last):\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 219, in start\n self.__tryHandleNextResponse()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/subscriber.py", line 158, in __tryHandleNextResponse\n message = self.connection.response()\n File "/opt/splunk/etc/apps/TA-eStreamer/bin/encore/estreamer/connection.py", line 203, in response\n raise estreamer.EncoreException( errMsg.format( version, str(message) ))\nEncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987. Message={'messageType': 0, 'length': 0, 'version': 0}\n

We are running Software Version 6.2.0.1

Thanks,

~Abhi

0 Karma

sastrach
Path Finder

Hi Abhi,

That's the problem!

From above: EncoreException: The server returned an invalid version (0) - expected 1. This is likely to be caused by CSCve44987 which is present on both FMC 6.1.0.3 and 6.2.0.1. For more information see: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCve44987

This is a known FMC bug affecting all eStreamer clients: eStreamer keeps crashing and encore keeps reconnecting and stumbling on as best it can. It's probably running only a few seconds out of every 2 minutes. TAC should be able to sort you out with the correct patch if you pass them the quoted exception. This will resolve your other issue too with your out of date data.

Sam

0 Karma

att35
Builder

Thanks Sam.

We are working with TAC on deploying this patch.

0 Karma

att35
Builder

Just for the sake of testing, I disabled all log items on eStreamer(FMC Side). We are still getting logs. timestamp has now moved to August 10, 2017 12:25:09 PM

The three items under Data(eNcore Setup page) for (Packets/Connection/Metadata) are still checked, but I would assume that once we unchecked all log options on FMC side data should immediately stop..

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...