This blog posting has been sitting in my drafts for a few weeks now as I’ve been battling and troubleshooting for ages so that I could give the solution to all our Distributed Virtual Switching problems we’ve been having. Thankfully I believe I finally can although I’m amazed that this may be the only blog out there with a solution. I put this down mainly due to Distributed Virtual Switches only being available in the Enterprise Plus edition of vSphere and therefore not many people either feeling the need to get this version or their companies not seeing the need to buy the edition. Thankfully I work for EMC and therefore I was able to procure myself a licence key for this edition and so set myself on the way to many eventual problems.
As I said in my communities open question the machines always seemed to fall off at differing times and showed no kind of patterns. Later on I noticed the machines were for some reason losing their ARP tables. The solution I found was one I am still unable to find a VMware article about.
It all came down to a difference in ESX versions and virtual hardware. Not 3.5 and vSphere(I’m not that thick…often) but the build versions. It seems that ESX servers installed with builds pre update 1 and ESX servers with update 1 installed don’t communicate/lose connectivity between themselves. So for instance when i had five servers on an ESX 4.0.0, 175625 build(pre update 1) and five on an ESX 4.0.0, 208167 (update 1a) build the ten total servers initially will all communicate fine with no problems, but then over time all the machines on the pre update 1 host will lose connectivity to both the machines they are on the same host as as well as the machines on the update 1a host and the outside world(aka the lose all connectivity). The five servers on the update 1a host though won’t lose connectivity to each other (although if the dns server they are using is on the pre update 1 then obviously dns will be lost) or to the outside world.
So the steps i followed to fix the problem were:
- Firstly upgrade the hosts to the latest versions. This can be done by VMware Update Manager if you have it setup in your environment or by the way I did it with esxupdate. Now I know loads of you who have been in the virtualisation field for a while will know this tool well as it was the only tool you could use pre esx 3.5 to update your machines and I’m still puzzled why the vSphere Host Update Utility cannot patch or upgrade ESX 4.0 hosts. I was going to write up the steps I use but David Davis @davidmdavis of TrainSignal fame has written up a great step by step guide of how to do this if you’re not familiar.
- Once this is done you will then need to upgrade the virtual machine hardware to version 7. Scott Lowe has done a brilliantly detailed posting of how to do this and the changes you need to make to allow you to use the latest networking capabilities. Now i know a bunch of you will think that you don’t need to update your esx hosts to the latest version to be able to upgrade virtual machine hardware but due partly I believe to the problems I was experiencing when I tried I got a very vague error of .Only once I had migrated the machine to the latest host would it let me upgrade the virtual hardware. My colleague Simon Phillips noticed this virtual hardware upgrade was a difference between machine that worked and ones that didn’t so credit is due to him on spotting this and finding Scott’s posting on how to upgrade the virtual hardware.
After these changes the machines all communicated without any problems and almost a week in haven’t shown any of the problems we were experiencing.
Funnily enough while building up this blog posting i came across a load of really interesting articles from fellow virtualisation professionals and i was going to do a wrap up of it all with the thoughts of putting your machine on standard or distributed switches and should you make it a virtual machine or not. But as of this morning Richard Brambley @rbrambley did a great one himself on the virtual centre side ,so definitely have a read. As well as these articles all surrounding the same topics and the problems and opinions some of the top people thought/have come across.
Sadly after finding out these solutions we’re now having to migrate all our machines back to standard switches due to our virtual centre server having database problems and needing a rebuild. I still think I would like to try use Distributed Virtual Switches again in the future but unless you have an enormous environment where you need the DVS’ I feel standard switches are more than adequate and at the moment less the pain.
Also a big thanks to Simon Phillips for all his help in this, Gabrie van Zanten for chatting through loads of it with me on gchat, all the guys on twitter who replied to me with ideas,the people who replied to my VMware communities question and the VMware helpdesk guy I caught unawares with all my questions when he called me about my virtual centre problems.
I’m always open for a chat/troubleshoot if you’re having the same problems so either leave a comment below or add me on twitter at @greggrobertson5.