Under 6.2.6 in my Search Head cluster (SHC) environment, I am starting to see the number of files grow in dispatch that are beyond their ttl and causing me to constantly monitor disk usage.
Dispatch reaper does not seem to be working.
I have a cron job to run clean-dispatch to try to get ahead of this but is this a known issue?
This is a known issue, SPL-107610/SPL-108806, where a SHC control file called shpoolManaged is preventing the dispatch reaper to perform its usual removal of search artifacts.
This issue does not affect 6.3
Note: the use of clean-dispatch is also affected with this issue where some artifacts may still remain after its run.
While a fix is anticipated in the next maintenance release, the following workaround can be implemented on your SHC members to mitigate the issue.
1) Create a script in $SPLUNK_HOME/bin/scripts called remove_SHC_control_files.sh
with the following content below (update the SPLUNK_HOME variable based on its location on the SH member)
2) Ensure the script belongs to the user who Splunk runs as and is executable.
#!/bin/bash
# PLEASE SET THIS BEFORE RUNNING THE SCRIPT
SPLUNK_HOME=/opt/splunk
VERBOSE=0
FILE_COUNT=0
for i in `find $SPLUNK_HOME/var/run/splunk/dispatch -type f -name 'shpoolManaged' | grep -v 'scheduler_'`
do
if [ $VERBOSE -eq 1 ]
then
echo "`date` - Deleting unneeded control file $i"
fi
rm $i
FILE_COUNT=`expr $FILE_COUNT + 1`
done
echo "`date` - $FILE_COUNT control file(s) deleted in non-scheduler artifacts"
3) Perform the following steps on each SHC member:
4) Edit the script and set the $SPLUNK_HOME variable based on where $SPLUNK_HOME is on the SHC members.
5) Copy the script to some location where it can be run by the user who Splunk runs as. Recommend: $SPLUNK_HOME/bin/scripts
6) Make sure the script belongs to the user who Splunk runs as and is executable
7) As the user who Splunk runs as, use crontab -e to set up a cronjob to run the script every minute.
Optionally, you can set up the script to log what its doing to a file.
For Example:
[root@sup-centos3-cu splunk62_SHC3]# crontab -e
* * * * * /opt/splunk/bin/scripts/remove_SHC_control_files.sh >> /opt/splunk/var/log/splunk/SHC_control_file_removal.log
😎 Verify that the script is doing its job by checking:
No SHC control file named shpoolManaged should be found in any non-scheduler artifact's directory
find $SPLUNK_HOME/var/run/splunk/dispatch -type f -name 'shpoolManaged' | grep -v 'scheduler_'
9) After a few minutes, the dispatch reaper should start catching up and the artifact count should come back to a reasonable number.
Whats in index=_internal source="*splunkd.log" log_level=ERROR?
I've seen this before when there was a data model accelerated via ui, but later the underlying data model was deleted manually, or something happened during its deletion... The only error in the logs was that it couldnt find a data model. When we found the search that was calling the data model and removed it, everything else started working.... dispatch cleared as expected, etc.
This is a known issue, SPL-107610/SPL-108806, where a SHC control file called shpoolManaged is preventing the dispatch reaper to perform its usual removal of search artifacts.
This issue does not affect 6.3
Note: the use of clean-dispatch is also affected with this issue where some artifacts may still remain after its run.
While a fix is anticipated in the next maintenance release, the following workaround can be implemented on your SHC members to mitigate the issue.
1) Create a script in $SPLUNK_HOME/bin/scripts called remove_SHC_control_files.sh
with the following content below (update the SPLUNK_HOME variable based on its location on the SH member)
2) Ensure the script belongs to the user who Splunk runs as and is executable.
#!/bin/bash
# PLEASE SET THIS BEFORE RUNNING THE SCRIPT
SPLUNK_HOME=/opt/splunk
VERBOSE=0
FILE_COUNT=0
for i in `find $SPLUNK_HOME/var/run/splunk/dispatch -type f -name 'shpoolManaged' | grep -v 'scheduler_'`
do
if [ $VERBOSE -eq 1 ]
then
echo "`date` - Deleting unneeded control file $i"
fi
rm $i
FILE_COUNT=`expr $FILE_COUNT + 1`
done
echo "`date` - $FILE_COUNT control file(s) deleted in non-scheduler artifacts"
3) Perform the following steps on each SHC member:
4) Edit the script and set the $SPLUNK_HOME variable based on where $SPLUNK_HOME is on the SHC members.
5) Copy the script to some location where it can be run by the user who Splunk runs as. Recommend: $SPLUNK_HOME/bin/scripts
6) Make sure the script belongs to the user who Splunk runs as and is executable
7) As the user who Splunk runs as, use crontab -e to set up a cronjob to run the script every minute.
Optionally, you can set up the script to log what its doing to a file.
For Example:
[root@sup-centos3-cu splunk62_SHC3]# crontab -e
* * * * * /opt/splunk/bin/scripts/remove_SHC_control_files.sh >> /opt/splunk/var/log/splunk/SHC_control_file_removal.log
😎 Verify that the script is doing its job by checking:
No SHC control file named shpoolManaged should be found in any non-scheduler artifact's directory
find $SPLUNK_HOME/var/run/splunk/dispatch -type f -name 'shpoolManaged' | grep -v 'scheduler_'
9) After a few minutes, the dispatch reaper should start catching up and the artifact count should come back to a reasonable number.
Ellen, would running step 8 validate that I have this issue in the first place? I've been battling an inflated dispatch directory after upgrading to 6.2.6 as well.
Yes that should be a quick check in your 6.2.6 SHC environment. If shpoolManaged does not exist in the non-scheduler artifact directories, the growing dispatch should be investigated separate from this known issue.
Thanks Ellen!!