Splunk Search

Reached end-of-stream error troubleshooting

RicoSuave
Builder

I sometimes receive the following error message in my shp environment (4.3.5) when executing a search:

ERROR: Reached end-of-stream while waiting for more data from peer . Search results might be incomplete.

I would like to know a few things about this message and how to find the root cause

  • What does this message mean?
  • What are the possible reasons for this message appearing?
  • Would streaming vs non streaming commands have an effect here?
  • What steps should i take to troubleshoot this and what logs would give me more insight into the error (besides splunkd.log)
  • If i were to open a case with support, besides a diag, what else should i provide?
Tags (1)
1 Solution

hexx
Splunk Employee
Splunk Employee

This message is raised by the distributed search framework whenever a search peer ceases to respond and/or send data mid-stream.

The most common reason for this is that the remote search process on the search peer reported in the error has crashed.

The next steps to take in this investigation are as follows:

  • Connect to the server hosting the search peer instance reported in the error.
  • Look in $SPLUNK_HOME/var/log/splunk for any crash log files corresponding to the time at which you saw the error.
  • Look in $SPLUNK_HOME/var/log/splunk/splunkd.log around the time at which the error was observed in the UI for anything relevant recorded by the main splunkd process.
  • Get the search ID (SID) of the search using the search job inspector and use it to find the artifact or the remote search, which lives @ $SPLUNK_HOME/var/run/splunk/dispatch/remote_$SID. In that artifact, look at the end of the search.log file for any indicators of the root cause of the crash.

View solution in original post

Rob
Splunk Employee
Splunk Employee

Unfortunately, this error message is very ambiguous because it really is just saying that the socket Splunk is listening on was not closed by Splunk. How or what closed that socket, even if it was done correctly, is something Splunk has no information on.

There are a variety of possible reasons that this message can appear and in a distributed search on older versions of Splunk, it can definitely be a red herring. You may wish to double check your results of the search against the raw events. Due to the wide variety of reasons that the socket may close and spawn this warning message, its hard to say what the root cause might be. This ranges from the OS doing some clean up, to timeout settings, to network latency, cluster timeouts, performance limitations, etc.

Streaming and non streaming search commands would pretty much have no effect as we are talking about different types of "streaming". The only exception would be in case of a search performance issue where making the search more performant may help in avoiding the error.

Troubleshooting this error should start with checking the metrics for the search(audit.log, metrics.log, search.log from the dispatch) that generated this error message. In other words, find out if there is a performance issue for the as this is one of the most common causes of the socket being closed prematurely and causing this error to appear. Corresponding to that, checking the splunkd.log will give an indication of what splunkd is doing at the time of the error but you will want to take a look at events before and at the time the error occurs as there may be an underlying issue. An example of such might be that a timeout has occurred for reaching a search peer, or perhaps the splunkd processor is spending a lot of time in one of the processing queues, etc. If the error happens with every search and is a global problem, then you may want to focus on finding errors in splunkd if there are only certain searches this occurs with then you will want to investigate the search job.

When opening a case to support for this error message, you will want to include a diag file and the dispatch folder for when the search ran that causes the warning message to be displayed. Depending on the issue, more information may be required and the support team will request it when needed.

hexx
Splunk Employee
Splunk Employee

This message is raised by the distributed search framework whenever a search peer ceases to respond and/or send data mid-stream.

The most common reason for this is that the remote search process on the search peer reported in the error has crashed.

The next steps to take in this investigation are as follows:

  • Connect to the server hosting the search peer instance reported in the error.
  • Look in $SPLUNK_HOME/var/log/splunk for any crash log files corresponding to the time at which you saw the error.
  • Look in $SPLUNK_HOME/var/log/splunk/splunkd.log around the time at which the error was observed in the UI for anything relevant recorded by the main splunkd process.
  • Get the search ID (SID) of the search using the search job inspector and use it to find the artifact or the remote search, which lives @ $SPLUNK_HOME/var/run/splunk/dispatch/remote_$SID. In that artifact, look at the end of the search.log file for any indicators of the root cause of the crash.
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...