After running a stop-all & start-all to restart, various topics/models have gone into a pending state and aren't getting data. It also appears that the data source connectors are no longer getting data as there are no EPS being displayed on the home page, and the number of processed events does not appear to be growing.
A stop-all start-all has been done a number of times since the issue started to try and rectify the problem.
What should I look into it further?
If you see below errors from the output of health check scripts,
Errors:
kubelet journalctl:
kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
kubelet_node_status.go:82] Attempting to register node YOUR_FQDN_SERVERNAME
kubelet_node_status.go:106] Unable to register node "YOUR_FQDN_SERVERNAME" with API server: nodes "YOUR_FQDN_SERVERNAME" is forbidden: node "short_hostname" cannot modify node "YOUR_FQDN_SERVERNAME"
kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
kubelet_node_status.go:82] Attempting to register node YOUR_FQDN
Output of health check:
concern summary: Checking YOUR_FQDN_SERVERNAME first ...
McAfee_ePO | Splunk | Processing | | SPLUNK/DIRECT | 2019-01-15 05:18:10 | 0 | 0 | 0 | 949874 | 0 | 485746 | 49 | <== significant failed/skipped events; review datasource SPL
eps: 0 <== no response from redis 'YOUR_FQDN'
status not OK ... <== check UI system health monitor for errors
splunkuba analyticsaggregator-rc-v6csh 0/1 Pending 0 14m <none> <none> <== pod 'analyticsaggregator-rc-v6csh' is 'Pending'; not 'Running'
splunkuba analyticsviewsbuilder-rc-nwffg 0/1 Pending 0 14m <none> <none> <== pod 'analyticsviewsbuilder-rc-nwffg' is 'Pending'; not 'Running'
splunkuba analyticswriter-rc-jscjf 0/1 Pending 0 14m <none> <none> <== pod 'analyticswriter-rc-jscjf' is 'Pending'; not 'Running'
splunkuba anomalyaggregationmodel-rc-j968d 0/1 Pending 0 14m <none> <none> <== pod 'anomalyaggregationmodel-rc-j968d' is 'Pending'; not 'Running'
splunkuba devicetopic-modelgroup01-rc-d5ljq 0/1 Pending 0 14m <none> <none> <== pod 'devicetopic-modelgroup01-rc-d5ljq' is 'Pending'; not 'Running'
splunkuba devicetopic-modelgroup01-rc-z5mzc 0/1 Pending 0 14m <none> <none> <== pod 'devicetopic-modelgroup01-rc-z5mzc' is 'Pending'; not 'Running'
splunkuba domaintopic-modelgroup01-rc-8pvcx 0/1 Pending 0 14m <none> <none> <== pod 'domaintopic-modelgroup01-rc-8pvcx' is 'Pending'; not 'Running'
splunkuba domaintopic-modelgroup01-rc-mmzhw 0/1 Pending 0 14m <none> <none> <== pod 'domaintopic-modelgroup01-rc-mmzhw' is 'Pending'; not 'Running'
.
.
.
The problem would be the way the hostnames are set with FQDN:
i.e.
hostnamectl status: Static hostname: YOUR_FQDN_SERVERNAME
e.g: /etc/hosts: (on Master Node)
127.0.0.1 localhost
Ip_address YOUR_FQDN_SERVERNAME(e.g uba1.splunk.com)
Ip_address YOUR_FQDN_SERVERNAME(e.g uba2.splunk.com)
Ip_address YOUR_FQDN_SERVERNAME(e.g uba3.splunk.com)
To resolve or rectify the issue:
1.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces
2.Stop containers and services
/opt/caspida/bin/Caspida stop-containers
/opt/caspida/bin/Caspida stop-container-services #command not in 4.1, added in 4.2
3.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces
4.Stop kubelet and docker on all nodes
sudo service kubelet stop && sudo service docker stop
ssh uba1 "sudo service kubelet stop && sudo service docker stop"
ssh uba2 "sudo service kubelet stop && sudo service docker stop"
5.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces
6.Update hostnames to have shortnames
sudo hostnamectl set-hostname uba1
ssh uba2 "sudo hostnamectl set-hostname uba2"
ssh uba3 "sudo hostnamectl set-hostname uba3"
7.Restart kubelet and docker on all nodes
sudo service docker start && sudo service kubelet start
ssh uba2 "sudo service docker start && sudo service kubelet start"
ssh uba3 "sudo service docker start && sudo service kubelet start"
8.Start containers and services
/opt/caspida/bin/Caspida start-container-services #command not in 4.1, added in 4.2
/opt/caspida/bin/Caspida start-containers
9.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces