Knowledge Management

Does anyone know what the following events mean?: "WARN CMSlave - event=populateSummaryInfo got unknown_state"

foresterd
Loves-to-Learn

Hello,

Has anyone seen events like this before? (looking at index=_internal):

WARN CMSlave - event=populateSummaryInfo got unknown_state summary=/splunk-data/hot-warm/sep/datamodel_summary/152_4A0854A5-EE41-403E-8D6C-C8A7ED6FE0BA/0D23E24D-41E9-464F-9AC4-72E9B27741C3/DM_Splunk_SA_CIM_Network_Traffic

WARN CMSlave - event=populateSummaryInfo got unknown_state summary=/splunk-data/hot-warm/paloalto/datamodel_summary/763_5E0FBB73-C7A1-4769-960C-315973743080/0D23E24D-41E9-464F-9AC4-72E9B27741C3/DM_Splunk_SA_CIM_Network_Traffic

Would anyone know what they mean? I have looked everywhere for an answer, but i have not found it yet.

Not sure what relevant info to include — so here is my stab at it:

  • There are 10 indexers in this setup (clustered) and every indexer is exhibiting similar events. -I do not know when this first happened — I just noticed it. -I have a search head for (ES) that has the data models restricted to specific indexes and tags — all accelerated models are at 100% completed with the exception of Network Traffic. Network traffic is at 87% and building very slowly. Not sure if I should rebuild or wait. -I doubt it's a permissions issue since Splunk is able to write indexed data to the directory (/splunk-data/hot-warm).
  • Splunk version is 7.1.3 on RedHat Linux v7.3 & v7.4 64bit.
  • From my management console I see the directory is not full (/splunk-data/hot-warm)(only using ~ 625gb of ~6tb).

Thanks,

Daniel Forester

0 Karma

andygerberkp
Explorer

This problem appears to be connected to out of date and/or incorrect manifest.csv files that live under the datamodel_summaries directories in every index directory.

Remove those files with a find/exec rm. Splunk will rebuild this file in a few minutes, and it does not require a restart of the indexer or any cluster maint mode. The error message should then cease. If it doesn't, try removing them a second time.

It also led, in our case, to incomplete data models due to Splunk thinking that the datamodel volume sizes were, in some cases 10x their actual size.

Here's SPL to find this error. Note the two spaces between "WARN and CMslave"

index=_internal  sourcetype=splunkd "WARN  CMSlave" *datamodel_summary*

It seems that it can take multiple removals of the manifest.csv files to persuade some indexers to rewrite them correctly.

This rest call shows _splunk_summaries volume sizes:

| rest splunk_server=<something to identify your indexers> /services/data/index-volumes
            | fields splunk_server, title, total_size, max_size, volume_path
            |search title=_splunk_summaries
            | eval total_size_gb = if(isnull(total_size), "-", round(total_size / 1024, 2))
            | eval max_size_gb = if(isnull(max_size) OR max_size = "infinite", "unlimited", round(max_size / 1024, 2))
            | eval disk_usage_gb = total_size_gb
            | fields splunk_server, title, disk_usage_gb, max_size_gb, volume_path
            | sort - disk_usage_gb
            | rename title as Volume, disk_usage_gb as "Volume Usage (GB)", max_size_gb as "Volume Capacity (GB)", volume_path as "Volume Path"
0 Karma

foresterd
Loves-to-Learn

Based on what you last wrote, I compared the manifest.csv against the data_summary folder on one of my indexers and found it did not completely match. I found that some info was missing and some was extra. However, I have yet to remove the manifest.csv from any indexer(s) because there has not been any issues related to this in the past 2 days. Not that it won't happen again - I trust it will since it has been a sporadic issue thus far.

From this point on, I will be monitoring the system very closely so when it does happen again I will take the action you prescribed. I will post an update with the outcome thereafter. By the way, I really appreciate the SPL you included in your post. It helped immensely.

So my thought process on correcting this behavior would be to remove the offending csv, within the data_summary folder, only on the indexer(s) that exhibit the issue for a particular data source (index). Would this have been your approach?

0 Karma

andygerberkp
Explorer

My approach would be to blow them all away. They are regenerated nearly immediately, there is no side-effect or problem with blowing them away, and it's easy. Note that there are some manifest.csv files that do not exist under the datamodel_summary directories, so it's important not to remove those. If you have concerns you can move them sideways rather than delete them, but I'm not sure that works as well - if splunk has a filehandle open on the file, it may update the moved files.

find  $SPLUNK_HOME/var/lib/splunk/\*/datamodel_summary -type f -name manifest.csv -exec rm {} \;
0 Karma

foresterd
Loves-to-Learn

Thanks for your reply andygerberkp,

Your response has helped me get closer to another underlying issue that seems to be related to the datamodel_summary directory storage space.
See: (https://community.splunk.com/t5/Getting-Data-In/maxVolumeDataSizeMB-Setting-Precedence/m-p/504219#M8...)

I am observing that indexer storage volume for hot-warm (not cold) is not fully utilized. It's as if there was a setting that limits the data from using the prescribed space. Because of this I sometimes see some data roll to frozen from hot-warm. So in a nutshell, the hot-warm volume is only using ~950 GB of 5664 GB at all times.

Anyway, not to be long winded, but my assumption is that the datamodel_summary data may be getting purged prematurely to where the manifest file cannot fully keep up and may still think the data it seeks is still available but it really is not. In this condition, I then suppose the Warning log is generated (WARN CMSlave - event=populateSummaryInfo got unknown_state) for the issue.

I am curious - Once you got rid of the manifest file(s), was the problem gone for good or have you had to repeat the process periodically?

0 Karma

andygerberkp
Explorer

As of now we have not had to repeat the process;  I think the problem existed for a very long time before we detected it.  Latest from Splunk Support is a patch will be in a 7.3.x version for this issue.

0 Karma

foresterd
Loves-to-Learn

Update: The issue appears to have resolved itself overnight. No sure what caused it or how it resolved itself. If I knew I would have shared it with everyone. Thanks.

0 Karma

foresterd
Loves-to-Learn

No updates to the issue thus far. I still have the problem. I re-checked the file and folder permissions for one of the errors within the "hot-warm" volume on one of my indexers and everything seems ok. The full path is also there for the datamodel_summary folder. The issue still seems to be sporadic - as if it skips writing to disk at times.

Now you mention volume size... I checked that as well. I have a global parameter that limits the total datamodel_summary size to 100gb on the indexers (indexes.conf).

My thoughts are that you are on to something. Daily data ingestion rate for the system is ~ 180GB so I am thinking that the summary folder size should be bigger as well? It appears as if the "skip" may be related to the summary folder size as you mention - to where it skips writing the summary data when there is not enough disk space at that particular time. I now need to research this to see if there is a possible allowance or restriction related to disk sizes for the datamodel_summary.

By the way, I am now on Splunk v7.3.1.

0 Karma

andygerberkp
Explorer

See my answer below. I think this is the root cause. Would be curious to see if you have similar experience.

0 Karma

foresterd
Loves-to-Learn

My apologies for prematurely posting an answer. I still have the issue. I happened to check my logs for this problem today and found 81 events in the last 24 hours. All events appear at random time intervals with an average of 15 events for each occurrence.

By the way, to provide a minor update to the first post, the network traffic data model build has completed and is at 100%.

0 Karma

andygerberkp
Explorer

Any updates to this issue? I'm seeing this - I think it relates to incorrect volume sizes reported for _splunk_summaries and missing data model information.

0 Karma

nagendra1111
New Member

Are you able to find root cause of it?

0 Karma

foresterd
Loves-to-Learn

I have not. However, it seems to be less since we upgraded from 7.1.3 to 7.3.1. When I last checked, I only had two warnings in a 24hr period.

0 Karma

nagendra1111
New Member

have you found any errors at that time related to bundle replication or errors related to

WARN S2SFileReceiver - error alerting slave about summary
WARN S2SFileReceiver - event=onFlushReceived

0 Karma

foresterd
Loves-to-Learn

No, I have not seen that error in the logs within the same time frame (or further).

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...