Hello,
Has anyone seen events like this before? (looking at index=_internal):
WARN CMSlave - event=populateSummaryInfo got unknown_state summary=/splunk-data/hot-warm/sep/datamodel_summary/152_4A0854A5-EE41-403E-8D6C-C8A7ED6FE0BA/0D23E24D-41E9-464F-9AC4-72E9B27741C3/DM_Splunk_SA_CIM_Network_Traffic
WARN CMSlave - event=populateSummaryInfo got unknown_state summary=/splunk-data/hot-warm/paloalto/datamodel_summary/763_5E0FBB73-C7A1-4769-960C-315973743080/0D23E24D-41E9-464F-9AC4-72E9B27741C3/DM_Splunk_SA_CIM_Network_Traffic
Would anyone know what they mean? I have looked everywhere for an answer, but i have not found it yet.
Not sure what relevant info to include — so here is my stab at it:
Thanks,
Daniel Forester
This problem appears to be connected to out of date and/or incorrect manifest.csv files that live under the datamodel_summaries directories in every index directory.
Remove those files with a find/exec rm. Splunk will rebuild this file in a few minutes, and it does not require a restart of the indexer or any cluster maint mode. The error message should then cease. If it doesn't, try removing them a second time.
It also led, in our case, to incomplete data models due to Splunk thinking that the datamodel volume sizes were, in some cases 10x their actual size.
Here's SPL to find this error. Note the two spaces between "WARN and CMslave"
index=_internal sourcetype=splunkd "WARN CMSlave" *datamodel_summary*
It seems that it can take multiple removals of the manifest.csv files to persuade some indexers to rewrite them correctly.
This rest call shows _splunk_summaries volume sizes:
| rest splunk_server=<something to identify your indexers> /services/data/index-volumes
| fields splunk_server, title, total_size, max_size, volume_path
|search title=_splunk_summaries
| eval total_size_gb = if(isnull(total_size), "-", round(total_size / 1024, 2))
| eval max_size_gb = if(isnull(max_size) OR max_size = "infinite", "unlimited", round(max_size / 1024, 2))
| eval disk_usage_gb = total_size_gb
| fields splunk_server, title, disk_usage_gb, max_size_gb, volume_path
| sort - disk_usage_gb
| rename title as Volume, disk_usage_gb as "Volume Usage (GB)", max_size_gb as "Volume Capacity (GB)", volume_path as "Volume Path"
Based on what you last wrote, I compared the manifest.csv against the data_summary folder on one of my indexers and found it did not completely match. I found that some info was missing and some was extra. However, I have yet to remove the manifest.csv from any indexer(s) because there has not been any issues related to this in the past 2 days. Not that it won't happen again - I trust it will since it has been a sporadic issue thus far.
From this point on, I will be monitoring the system very closely so when it does happen again I will take the action you prescribed. I will post an update with the outcome thereafter. By the way, I really appreciate the SPL you included in your post. It helped immensely.
So my thought process on correcting this behavior would be to remove the offending csv, within the data_summary folder, only on the indexer(s) that exhibit the issue for a particular data source (index). Would this have been your approach?
My approach would be to blow them all away. They are regenerated nearly immediately, there is no side-effect or problem with blowing them away, and it's easy. Note that there are some manifest.csv files that do not exist under the datamodel_summary directories, so it's important not to remove those. If you have concerns you can move them sideways rather than delete them, but I'm not sure that works as well - if splunk has a filehandle open on the file, it may update the moved files.
find $SPLUNK_HOME/var/lib/splunk/\*/datamodel_summary -type f -name manifest.csv -exec rm {} \;
Thanks for your reply andygerberkp,
Your response has helped me get closer to another underlying issue that seems to be related to the datamodel_summary directory storage space.
See: (https://community.splunk.com/t5/Getting-Data-In/maxVolumeDataSizeMB-Setting-Precedence/m-p/504219#M8...)
I am observing that indexer storage volume for hot-warm (not cold) is not fully utilized. It's as if there was a setting that limits the data from using the prescribed space. Because of this I sometimes see some data roll to frozen from hot-warm. So in a nutshell, the hot-warm volume is only using ~950 GB of 5664 GB at all times.
Anyway, not to be long winded, but my assumption is that the datamodel_summary data may be getting purged prematurely to where the manifest file cannot fully keep up and may still think the data it seeks is still available but it really is not. In this condition, I then suppose the Warning log is generated (WARN CMSlave - event=populateSummaryInfo got unknown_state) for the issue.
I am curious - Once you got rid of the manifest file(s), was the problem gone for good or have you had to repeat the process periodically?
As of now we have not had to repeat the process; I think the problem existed for a very long time before we detected it. Latest from Splunk Support is a patch will be in a 7.3.x version for this issue.
Update: The issue appears to have resolved itself overnight. No sure what caused it or how it resolved itself. If I knew I would have shared it with everyone. Thanks.
No updates to the issue thus far. I still have the problem. I re-checked the file and folder permissions for one of the errors within the "hot-warm" volume on one of my indexers and everything seems ok. The full path is also there for the datamodel_summary folder. The issue still seems to be sporadic - as if it skips writing to disk at times.
Now you mention volume size... I checked that as well. I have a global parameter that limits the total datamodel_summary size to 100gb on the indexers (indexes.conf).
My thoughts are that you are on to something. Daily data ingestion rate for the system is ~ 180GB so I am thinking that the summary folder size should be bigger as well? It appears as if the "skip" may be related to the summary folder size as you mention - to where it skips writing the summary data when there is not enough disk space at that particular time. I now need to research this to see if there is a possible allowance or restriction related to disk sizes for the datamodel_summary.
By the way, I am now on Splunk v7.3.1.
See my answer below. I think this is the root cause. Would be curious to see if you have similar experience.
My apologies for prematurely posting an answer. I still have the issue. I happened to check my logs for this problem today and found 81 events in the last 24 hours. All events appear at random time intervals with an average of 15 events for each occurrence.
By the way, to provide a minor update to the first post, the network traffic data model build has completed and is at 100%.
Any updates to this issue? I'm seeing this - I think it relates to incorrect volume sizes reported for _splunk_summaries and missing data model information.
Are you able to find root cause of it?
I have not. However, it seems to be less since we upgraded from 7.1.3 to 7.3.1. When I last checked, I only had two warnings in a 24hr period.
have you found any errors at that time related to bundle replication or errors related to
WARN S2SFileReceiver - error alerting slave about summary
WARN S2SFileReceiver - event=onFlushReceived
No, I have not seen that error in the logs within the same time frame (or further).