If your esxi hosts in your cluster keep going into a Not Responding or Disconnected state, then the following things should be immediately checked – the DNS addresses on the ESXi hosts, the hosts files on the ESXi hosts, the Managed IP Address setting in vCenter, and the hosts file on the vCenter Server.
DNS addresses on the ESXi hosts management interface are both configured to point to the DNS servers containing the A records of the ESXi hosts and vCenter Server itself. Check they’re not pointing to the wrong DNS servers and log onto the unsupported console and perform an nslookup of the vCenter Server to check.
The resilience of name resolution being dependent on an external service should be bolstered with hosts files at either end.
From the vSphere Client, log in to vCenter Server
Navigate to Administration > vCenter Server Settings > Runtime Settings and review the Managed IP Address setting.
Verify that the address is correct (use ipconfig to discover the correct IP address for the vCenter management (v)LAN NIC). Check all octets for correctness – be aware that the octets may not match that of the ESXi hosts so check the design document / consult the infrastructure architect.
Correct the entry and click OK to save your changes and close the dialog.
Restart the ESXi host(s) if it’s locked up. If not, Connect it back in to vCenter
If you need to restart it, and the vcenter server is running on a vm on it, then you should try to connect to the vcenter server over rdp first and shut it down.
Note: Once the esxi host has restarted, then you can power on the in-band vc vm using the instructions here… http://www.cyberfella.co.uk/2012/05/01/booting-vm-from-console/
Disable HA on the Cluster.
Put ESXi host into Maintenance Mode, moving off any powered on or powered off VM’s
Remove host from all DvSwitches in Home, Inventory, Networking
Remove the host from Cluster
Add host back into Cluster to push out new management agents and config containing corrected ipaddress of vCenter.
Take out of Maintenance mode.
Re-add host to DvSwitches.
Repeat for all ESXi hosts that were unstable.
It should now all stabilise (90 second wait). If so, re-enable HA on the cluster. If not use the following knowledge base article from VMware to trouble shoot other potential areas such as the firewall betwen the esxi hosts and vcenter server (if present).
A few days after this was written, I noticed hosts rarely disconnecting, but it was still happening. Adding a hosts file to each esxi host so that they can all resolve each others names with ot without DNS services being available, as well as that of the vcenter server, and clicking “reconfigure for vware ha” on each host from within vcenter, seems to have regained some stability.
The most immediate place to look for problems is the summary tab for each host in vcenter. The trouble is, that this usually gives very little away, usually describing a symptom rather than describing possible reasons for it. The best place to look is in the logs – not the messages log from the black and yellow console, but the vcenter and ha logs. Log onto the unsupported console on the esxi hosts and tail the logs below.
/var/log/vmware/vpx/vpxa.log Shows Agent can’t send heartbeat, No route to host. errors.
/var/log/vmware/aam/vmware_hostname.log Shows date and timestamp and “Node hostname has started receiving heartbeats from node hostname” informational events for intra-esxi host communications.
It’s worth noting that aam is a Legato heartbeat technology and is massively dependent on DNS being right.
Four days after writing this, the hosts once again began to enter a not responding state, followed by a disconnected state.
I have always suspected that cluster heartbeats are falling foul to log files being shipped to remote syslog servers. In esxi, there is a lot of logging going on, some of it, such as the entries in /var/log/vmware/hostd.log are also replicated to /var/log/messages, effectively doubling the amount of logging, which then has to all be replicated to (in my case) two remote syslog servers. This all amounts to a pretty continous stream of data travelling over the bonded physical nics that ultimately handle all traffic to not only the management network, but also vmotion. What alerted me to the suspicion that this could be the cause of my problems, was slow vmotion when migrating guests between hosts. Also, when running esxi on cisco ucs with snmp monitoring enabled, there is a lot of informational logging activity for hardware that is healthy (status Green).
Whilst my preference would be to split the bonded nics (no loss of redundancy on cisco UCS provided the vnics are set to failover at the ucs level), separating management and vmotion traffic, I have massively reduced the amount of logs being generated by making the following edit in
This stops the duplication of hostd log entries being written to /var/log/messages. You may be able to make similar changes to other agents, to make further reductions – I don’t know. It’s worth noting that if you make this change, you’ll need to issue the following command to restart hostd.
Another change I made was to create a new HA enabled cluster in virtualcentre, and after migrating all guests off each esxi host, place each host into maintenance mode and move it to the new cluster. Upon taking the esxi hosts out of maintenance mode, some re-enabled/re-deployed ha agents successfully, some did not. For those that didn’t, a restart of the management agents from the local console was sufficient to make it work (Reconfigure for HA). The problem with esxi is that after reboot, the logs are cleared, so if a host has lost its management network connection and its local console has seized up, then you can’t read the logs (unless you’re using a remote syslog server). These hosts management agents were obviously dying which will ultimately take down the management network too if you leave it long enough, yet theres no warning that this is going on until the host goes into a not responding state – visible in virtual centre.
Since making this change, the esxi hosts have not lost contact with the vcenter server at all (terminated their management agents daemons), or had their management networks seize to a halt in over a week. Based upon my observations to date with this issue, I’m claiming this as a success and am very relieved to have got to the bottom of it.