Monitoring Splunk

Orphaned Processes blocking Splunkd port

bruceclarke
Contributor

All,

We had an issue recently that I was hoping to get some clarification on what exactly was happening. I have an educated guess, but I wanted to see if someone could confirm my suspicions. The details are below. Thanks!

Recently, one of my colleagues wanted to add a new app to forward Windows process information to Splunk. Immediately upon adding the application to one of our search peers, the app forwarded almost 600MB in 1 hour. Shortly thereafter, the search peer became unresponsive (not surprising, since it was probably indexing a ton of data). At this point, I was notified and began looking into what was happening.

At first, I thought simply restarting the search peer would fix the issue. Unfortunately, I couldn't complete the restart as Splunk claimed that port 8089 was still in use. I killed all the processes running on 8089, then tried again, but still ran into the same error. After a while longer, I noticed that there were a bunch of python executables being run by user "splunkut." Upon killing these, I was able to restart fine, and the search peer became responsive again.

My guess is that the following happened. 1) When my colleague installed the new application, the search peer was inundated with events that it needed to index. 2) Splunk began spawning processes that it needed to help index. 3) Restarting Splunk didn't stop the indexing processes (can someone shed light as to why?). 4) These orphaned processes (probably being used to index) were still bound to port 8089, so I couldn't restart splunkd to bind to port 8089. 5) Killing these processes manually stopped Splunk from trying to index the data, and in turn opened up port 8089 again. Thus, I was able to restart the search peer and the daemon process was no longer timing out (since I killed the indexing).

Sorry for the long post. Could anyone let me know if this sounds like a legitimate explanation of what happened? If not, could you explain what might have actually been happening? Thanks!

1 Solution

sciurus
Path Finder

Sounds logical. The listening socket wasn't closed after the process fork'd, so it hung around in the child processes.

I would take a guess that the child processes didn't stop because they're expected to be short lived processes, so the "stop" procedure is to simply wait for them to finish their work and return. That assumption breaks down if they have an unexpected burst of work.

View solution in original post

sciurus
Path Finder

Sounds logical. The listening socket wasn't closed after the process fork'd, so it hung around in the child processes.

I would take a guess that the child processes didn't stop because they're expected to be short lived processes, so the "stop" procedure is to simply wait for them to finish their work and return. That assumption breaks down if they have an unexpected burst of work.

bruceclarke
Contributor

Thanks! This provides more context, which is what I wanted.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...