JuJu, Hadoop and Openstack. Amazing.

Wow.  This is the coolest thing I’ve seen since vMotion.  Watch from 15:00 for 40 seconds as Mark Shuttleworth migrates an entire infrastructure as a service  stack from one public cloud in amazon ec2 to another public cloud in hp with juju.  Just amazing.



Teaming Management NICs

The vmware esxi hypervisor with multiple nics can be configured a multitude of ways depending on the number of nics on board.

My lab hypervisors only have two, but that is enough to present a choice in itself, between splitting management, vmotion and iscsi traffic or alternatively teaming the two nics and putting all vmkernel ports, storage adapters and management traffic over a common active-active bonded link.

The lab environment has been running flawlessly for months with a physical split configured between management and vmotion/iscsi networks so I thought I’d configure up the “alternative” scenario and let that run to see how things go.

One thing to look out for when reconfiguring the networking on the ESXi hosts (apart from making sure all names of vmkernel ports match perfectly like before) is that the physical nics are both active afterwards.


Note one nic is in standby.

One of mine did it automatically, the other one didn’t.  This left me in an unforeseen situation whereby I wouldn’t have been getting the full bandwidth benefit of both nics on one of my hosts while attempting to run everything over a single nic.  This is definitely not recommended although in test, vmotion was still rapid -most likely due to very little else going on.

This would not be the case in a production environment and I’d certainly recommend migrating all your guests from any host that is being reconfigured and put it into maintenance mode.  I didn’t do either of these things but that said, the whole point is to push my lab to breaking point and document the experience – which is what happened.  More on that later.


Click on Move up, to make the second vmnic active.

With both nics active, you should see the following…

…both nics become active.

This change will possibly also require you to connect to the local console of each esxi host and manually restart the management network.  This is certainly the case for earlier ESXi 4.0.0.

Upon removing the second vSwitch my ESXi host lost connection to the iSCSI datastore and thus the virtualcentre vm’s hard disk etc.  Ordinarily this would not be a problem since in a clustered environment the other ESXi host would restart the guest, however the network configuration was in mid-change and thus did not match on both hosts in the cluster.  This is called “Proper breaking it” from where I’m from but is where the real learning happens.  Let it be in your back bedroom though and not on the datacentre floor.  To recover from the situation I first attempted to shut down the vm using the unsupported console (covered in an earlier post), which the esxi host said was still powered on.  It did not want to power off, or power on, or reset.  In fact the esxi host didn’t want to reboot either so it got a hard reset in the form of me pushing in and holding in the power button and wondering if I’d have to build an out of band vm to install vsphere client on so that I could complete the network configuration of the esxi host.

After reboot, I’d noticed that the cluster had restarted the guests including the virtualcentre server on the other healthy host which I thought was pretty impressive since it saved me a bunch of hassle.  This enabled me to continue reconfiguring the new vmotion vmkernel port on the bonded nics.  A quick check over suggested everything was consistent except I’d lost visibility of the iscsi target.  A quick rescan and it re-appeared and a successful vmotion of my DLNA server in mid-flight proved it was all healthy again.  I’ll see how well it behaves unattended over the next few weeks / months.  I’d like to know how I might force a rescan down the virtual iscsi storage adapter for the iscsi target where the datastore resides from the unsupported console though.  I wouldn’t be surprised to find out it can’t be done, in which case I’d find myself installing vsphere client on another machine and doing it using the gui.


Setting a Round-Robin Fibre Channel Path Policy on ESXi

If you’re DataStores are using Fibre channel storage and you have multple fchba’s connected through to the SAN via a fc switch or two, then it is prudent to optimise the IO potential of all the redundant hardware by changing the fc path policy to “round robin”.

Using vsphere client, connect to the ESXi host / vcentre server

Inventory, Hosts and Clusters, select ESXi host

Configuration tab, Storage, highlight the datastore, click Properties

Click Manage Paths button on the DataStore properties dialog

Change Path Selection to Round Robin (VMWARE) – the default is Most Recently Used (VMWARE)

Wait for the screen to reload, Click Close

Repeat for each DataStore, and then Repeat for each ESXi host.



Automatic startup of ESXi Guests

It’s not immediately obvious where you configure vm’s to startup automatically when the esxi host starts.

In Hosts and Clusters view (in Virtual Centre), click on the ESXi host – that’s HOST, not GUEST, i.e. the machine running vmware esxi, not the virtual machine itself.

On the Configuration tab, in Software settings, select Virtual Machine Startup/Shutdown.

By default, automatic startup of virtual machines is disabled, so you need to enable it before you can move the vms upwards into Automatic Startup.  Click Properties in the top left hand corner and tick

Allow Virtual Machines to start and stop automatically with the system.

Select your DC and/or VC vm and move it all the way up so it exists in the Automatic Startup section of the Startup Order Dialog box.

Apply a delay if you want to, but don’t choose a value less than 90 seconds.


Resetting Cisco UCS KVM

From experience, it’s not uncommon to not be able to connect to the KVM of a cisco ucs blade.   Instead of seeing a Remote console screen, you’ll receive a “Connect failed” or “Request Shared Session” message, with no means of getting to the console.

Within the Service Profile, click on the Server Details tab. From there, click on Recover Server. Select “Reset CIMC (Server Controller)”.  Choose Reset KVM Controller.  This will kill existing KVM sessions and allow you to start a new session. Resetting the CIMC does not affect data traffic to/from the server NICs (ethernet and HBAs).

Another thing to check in Servers tab, General tab is the Management IP Address setting.  If it’s configured to take an address from a pool, check the pool in the Admin tab, Management IP Pool, IP Addresses tab to see what IP’s exist in the range, and whats been assigned.

If a reset hasn’t worked, In the Servers tab, General tab, Management IP Address section, change the IP address from Pooled to Static.  Use an IP address from the other end of the range in the pool.  Click Save Changes, and try connecting to the KVM again.



Hardening VMware Guests (VMs)

The guest vm needs to be shutdown.

Remove any superfluous hardware such as cdrom drives and floppy drives and usb.

In the Inventory panel, right click the virtual machine, Settings, Options, Advanced, General

Click Configuration Parameters button

Add the following lines to the guests vmx file…










Booting an in-band VirtualCentre Server VM from the ESXi console

If your VirtualCentre server is itself a VM, then it’ll be running on an ESXi host.  In the event that the ESXi host is restarted without vMotioning the VirtualCenter Server first (such as when the management network is irrecoveraby unresponsive), then depending on your environment, you may not be able to get a remote connection to the vm after the host has restarted.  In this scenario, you’d need to be able to boot the VM from the unsupported console.  This is how to do it.

Connect to the iLo or equivalent management interface to the ESX host, send an Alt-F1 and type unsupported followed by the root password to obtain a prompt on the unsupported console.

Identify the VM’s resident on the host

vim-cmd vmsvc/getallvms

Identify the current power state of the vm running virtual centre

vim-cmd vmsvc/power.getstate ##           where ## is the number of the vm identified above

Power on the vm

vim-cmd vmsvc/power.on ##                       where ## is the number of the vm identified above


ESXi hosts keep dropping out of vCenter cluster

If your esxi hosts in your cluster keep going into a Not Responding or Disconnected state, then the following things should be immediately checked – the DNS addresses on the ESXi hosts, the hosts files on the ESXi hosts, the Managed IP Address setting in vCenter, and the hosts file on the vCenter Server.

DNS addresses on the ESXi hosts management interface are both configured to point to the DNS servers containing the A records of the ESXi hosts and vCenter Server itself.  Check they’re not pointing to the wrong DNS servers and log onto the unsupported console and perform an nslookup of the vCenter Server to check.

The resilience of name resolution being dependent on an external service should be bolstered with hosts files at either end.

From the vSphere Client, log in to vCenter Server
Navigate to Administration > vCenter Server Settings > Runtime Settings and review the Managed IP Address setting.
Verify that the address is correct (use ipconfig to discover the correct IP address for the vCenter management (v)LAN NIC).  Check all octets for correctness – be aware that the octets may not match that of the ESXi hosts so check the design document / consult the infrastructure architect.

Correct the entry and click OK to save your changes and close the dialog.
Restart the ESXi host(s) if it’s locked up.  If not, Connect it back in to vCenter

If you need to restart it, and the vcenter server is running on  a vm on it, then you should try to connect to the vcenter server over rdp first and shut it down.

Note: Once the esxi host has restarted, then you can power on the in-band vc vm using the instructions here… http://www.cyberfella.co.uk/2012/05/01/booting-vm-from-console/

Disable HA on the Cluster.

Put ESXi host into Maintenance Mode, moving off any powered on or powered off VM’s
Remove host from all DvSwitches in Home, Inventory, Networking
Remove the host from Cluster
Add host back into Cluster to push out new management agents and config containing corrected ipaddress of vCenter.
Take out of Maintenance mode.

Re-add host to DvSwitches.

Repeat for all ESXi hosts that were unstable.

It should now all stabilise (90 second wait).  If so, re-enable HA on the cluster.   If not use the following knowledge base article from VMware to trouble shoot other potential areas such as the firewall betwen the esxi hosts and vcenter server (if present).


A few days after this was written, I noticed hosts rarely disconnecting, but it was still happening.  Adding a hosts file to each esxi host so that they can all resolve each others names with ot without DNS services being available, as well as that of the vcenter server, and clicking “reconfigure for vware ha” on each host from within vcenter, seems to have regained some stability.

The most immediate place to look for problems is the summary tab for each host in vcenter.  The trouble is, that this usually gives very little away, usually describing a symptom rather than describing possible reasons for it.  The best place to look is in the logs – not the messages log from the black and yellow console, but the vcenter and ha logs.   Log onto the unsupported console on the esxi hosts and tail the logs below.

/var/log/vmware/vpx/vpxa.log     Shows Agent can’t send heartbeat, No route to host. errors.

/var/log/vmware/aam/vmware_hostname.log       Shows date and timestamp and “Node hostname has started receiving heartbeats from node hostname”  informational events for intra-esxi host communications.

It’s worth noting that aam is a Legato heartbeat technology and is massively dependent on DNS being right.

Four days after writing this, the hosts once again began to enter a not responding state, followed by a disconnected state.

I have always suspected that cluster heartbeats are falling foul to log files being shipped to remote syslog servers.  In esxi, there is a lot of logging going on, some of it, such as the entries in /var/log/vmware/hostd.log are also replicated to /var/log/messages, effectively doubling the amount of logging, which then has to all be replicated to (in my case) two remote syslog servers.  This all amounts to a pretty continous stream of data travelling over the bonded physical nics that ultimately handle all traffic to not only the management network, but also vmotion.  What alerted me to the suspicion that this could be the cause of my problems, was slow vmotion when migrating guests between hosts.  Also, when running esxi on cisco ucs with snmp monitoring enabled, there is a lot of informational logging activity for hardware that is healthy (status Green).

Whilst my preference would be to split the bonded nics (no loss of redundancy on cisco UCS provided the vnics are set to failover at the ucs level),  separating management and vmotion traffic, I have massively reduced the amount of logs being generated by making the following edit in



This stops the duplication of hostd log entries being written to /var/log/messages.  You may be able to make similar changes to other agents, to make further reductions – I don’t know.  It’s worth noting that if you make this change, you’ll need to issue the following command to restart hostd.

/etc/init.d/hostd restart

Another change I made was to create a new HA enabled cluster in virtualcentre, and after migrating all guests off each esxi host, place each host into maintenance mode and move it to the new cluster.  Upon taking the esxi hosts out of maintenance mode, some re-enabled/re-deployed ha agents successfully, some did not.  For those that didn’t, a restart of the management agents from the local console was sufficient to make it work (Reconfigure for HA).  The problem with esxi is that after reboot, the logs are cleared, so if a host has lost its management network connection and its local console has seized up, then you can’t read the logs (unless you’re using a remote syslog server).  These hosts management agents were obviously dying which will ultimately take down the management network too if you leave it long enough, yet theres no warning that this is going on until the host goes into a not responding state – visible in virtual centre.

Since making this change, the esxi hosts have not lost contact with the vcenter server at all (terminated their management agents daemons), or had their management networks seize to a halt in over a week.  Based upon my observations to date with this issue, I’m claiming this as a success and am very relieved to have got to the bottom of it.



Enabling SSH on VMWare ESXi hosts

Log on to the local console,  Alt + F1 and type unsupported  ENTER to connect to the console.

Enter the password when prompted.

vi /etc/inetd.conf

Remove the # at the beginning of the #ssh line to uncomment the ssh service.

:wq! to write the changes and quit the vi editor.

Identify the inetd process using ps | grep inetd

restart the inetd service with kill -HUP pid

clear, exit, Alt + F2 to log out of the unsupported console.  Esc to log out the local management console.

You’ll notice a warning appear in vSphere client stating that the remote administration console has been enabled.  This is considered a security risk, but it is possible to suppress the warning if you wish to leave it open (not recommended).

Before quitting the console, type

esxcli system settings advanced set -o /UserVars/SuppressShellWarning -i 1

to disable the warning, or

esxcli system settings advanced set -o /UserVars/SuppressShellWarning -i 0

to re-enable it (recommended if you disable the ssh console again).


Management Logs on Cisco UCS Blades

As of firmware 1.4.1 (old now), the Management Logs tab was renamed SEL Logs.

If you’re running VMWare ESXi on a Cisco M200 blade, then you may notice a hardware event trigger in vCenter Server, with a fault of System Board 0 SEL_FULLNESS.

This occurs when the UCS Management Log for a given blade breaches it’s own monitoring threshold of 90% full.

To clear it, Log into UCS Manager, Equipment tab, Servers, Server n, SEL Logs tab, and Backup or Clear the log.

Don’t forget to at least take a look at the log to make sure it hasn’t filled due to real, unresolved hardware problems.  The SEL Log logs absolutely everything that goes on to the extent of even logging LED’s as they turn on and off on the equipment, so these logs fill quite quickly.