RUCKUS SmartZone controller cluster is out-of-service / services are offline / cannot access the web UI
Summary
A SmartZone node goes out-of-service due to different reasons like one or many applications not online, network reachability, abnormally high CPU, memory or disk space. Configurer.log can be used to investigate more details about the service that caused the node to go out of service. When more than one node goes out of service, Cluster goes out of service as well. Remedial step will depend on the cause, however, restarting services is a general resolution if the node is out of service due to one or multiple applications in offline state. If caused by network reachability, check if all the interfaces are up and they can communicate with the other nodes interfaces.Question
How to troubleshoot SmartZone Cluster or Node out of service?Customer Environment
Any version of SmartZone or Virtual SmartZoneRoot Cause
N/ASymptoms
Symptoms may vary case to case however following are some common symptoms (but not limited to):- Alarm shows "Not all management services on node are ready"
- Alarm shows "Node [{nodeName}] in cluster [{clusterName}] is out of service. Reason:[{reason}]."
- Alarm shows "clusterInMaintenanceState"
- One of the Control plane node shows offline in the Cluster menu on GUI
- CLI accessible but Web UI not coming up.
Troubleshooting Steps
A Cluster is made up of more than one SmartZone node. Minimum 1 node and maximum 4 nodes can be grouped into a cluster. It follows N+1 redundancy mechanism which means it can accommodate failure of only one node. The node that fails is called Out-of-service node and if more than one node fails in the cluster, the cluster state changes to Out-of-service. Lets have a look at the different scenarios you may face and the required troubleshooting :
Case 1 – Node out of service in multi-node cluster
- First, check if the node that is flagged as Out-of-Service is reachable from other nodes or not. If there is no reachability between the nodes, check the following, else proceed to next step.
- The affected node is powered-up or not
- All interfaces (Control, Cluster and Management) are up or not
- If Console access is available, try pinging gateway and do traceroute to see any network connectivity issue.
- If the reachability is there, login to the affected node using SSH or console (SSH preferred) and check the current node status with the command “show cluster-state”.
vSZ-E-45# show cluster-state
Current Management Service Status : Out of Service
Current Node Status : Out of Service
Cluster Status : In service
Cluster Operation : None
System Mode : None
Cluster Node Information
----------------------------------------------------------------
No. Name Role
----- ------------------------ -----------
1 vSZ-E-54-C FOLLOWER
2 vSZ-E-45-C LEADER
-
If the Current node status is “In Service”, this means the node is healthy. If this shows “Out of Service”, check the status of all services in this node by running command “show service”. In a healthy state, all services should show “online” like the following screenshot. If one or more services show as “offline”, try restarting services by issuing the command “service restart”. This will trigger restart of all the services and takes around 10-20 minutes for all services to come online.
Note - Please ensure you are running this command on the affected SmartZone node after verifying it's IP, mac-address or Serial Number. For single node Cluster, use this command cautiously as it will restart all the services on the SmartZone.
-
If one or more services still do not come online, check if the system mode is in “Maintenance” or “Crash” Mode by running “show cluster-state” command and verify the “system mode” status. If the system mode shows either as “Maintenance” or “Crash”, collect the following logs/data and open a support case with RUCKUS Support : -
-
Go to Monitor > Troubleshooting & Diagnostics > Application logs > Select the Control plane > Change the logging level of Configurer and Core to debug mode and save it. Wait a few minutes and then download the snapshot logs. Repeat this for all the nodes/control plane in the cluster. After the logs have been collected, change the logging level back to Warning for Configurer and Core.
-
“show service”, “show cluster-state”, “show disk” and “show mem” CLI output from the affected node
-
Record of any recent change in network or any power-outage events etc.
-
If the “system mode” is neither “Maintenance” nor “Crash” and services still not coming up, please verify there is enough free memory and free disk space on the node by running “show disk” and “show mem” CLI command. If you see very high memory or disk usage in the case of Virtual SmartZone, shutdown the node and increase its resources as per the Virtual SmartZone getting started guide available on support.ruckuswireless.com portal.
Note – Please ensure that you are doing this on the correct node. Match the serial number, IP address etc. before proceeding.
-
Verify if there was a power-outage after which this event was triggered. Sometimes, in such cases, the VM or its database gets corrupted.
-
If none of the above helps in bringing the node in service, please open a support case with following logs/information:
-
Go to Monitor > Troubleshooting & Diagnostics> Application logs > Select the Control plane > Change the logging level of Configurer and Core to debug mode and save it. Wait a few minutes and then download the snapshot logs. Repeat this for all the nodes/control plane in the cluster. After the logs have been collected, change the logging level back to Warning for Configurer and Core.
-
“show service”, “show cluster-state”, “show disk” and “show mem” CLI output from the affected node
-
Record of any recent change in network or any power-outage events etc.
Case 2 – Cluster state is out of service in multi-node cluster
If more than one node in a Cluster goes Out-of-Service, the whole cluster becomes Out of Service because it works on N+1 redundancy. In this case, we need to troubleshoot in the same manner as we did in previous case for Node out of service except this needs to be done individually for all the out of service nodes.
Case 3 – Cluster state is out of service in Single node cluster
A single node cluster has no redundancy, hence if this node goes out of service, the cluster state will also go out of service and need similar troubleshooting as we did for node out of service.
Case 4 – Cluster is in NetworkPartitionSuspected mode
This is only seen in 2-node cluster. The follower node cluster-state shows as “NetworkPartitionSuspected”. In general, it could be caused by the following: -
-
Connection of cluster network has been broken; two nodes cannot communicate to each other via cluster interface.
-
Management network or control network might be down as well, but not all of the network interfaces are down.
APs still connect to original blades and do not cause a failover in this state.
Cluster state of the Leader node is in service; while that of Follower node is NetworkPartitionSuspected.
In this scenario, the first thing that should be done is recovering the network connectivity between the nodes. Please do not change any configuration on any node or do any backup or snapshot restore in this state.
Resolution
Resolution will vary based on the troubleshooting explained above. However, there are some common resolutions to recover from the issue:
-
“service restart” CLI command to restart all the services on the affected SmartZone node.
Note - Please ensure you are running this command on the affected SmartZone node after verifying it's IP, mac-address or Serial Number.
-
Increasing disk or memory in Virtual machines if found running at maximum capacity
-
Open support case with RUCKUS after collecting all the required data as explained in the Troubleshooting section.
Article Number:
000014380
Updated:
October 09, 2024 01:25 PM (2 months ago)
Tags:
Troubleshooting, SZ144, SZ300, SCG200, virtual SmartCell Gateway, SZ100
Votes:
0
This article is:
helpful
not helpful