TheSaffaGeek

My ramblings about all things technical


Leave a comment

VCDX Troubleshooting Skills

55020378So this posting isn’t about my opinion of if the dropping of the scenario is good or bad as in all honesty more time in the design scenario sounds great to me. This posting is actually about the resources I used to prepare for my VCDX troubleshooting scenario that I think an architect should know and thereby any good VCDX should also.

  • The first resources I used were actually the ones I used in my preparations for the VCAP5-DCA as this really makes you learn where all the logs are, what methods there are of troubleshooting issues and what you might be looking for. My study resources list for the VCAP5-DCA is a great start and if you are at the point of defending for VCDX you should have used some of these in your preparations but what I went over again were the troubleshooting videos by David Davis. Even though they are old the methods in them still apply especially ESXTOP etc.
  • The next resources were a mix between my two mentors for my recent VCDX attempt aka Larus Hjartarson and Rene van den Bedem. Both of them did brilliant breakdowns of how to prepare and think during the scenario and the methodology you need to keep to. These methods give you a great plan of attack even if it is a real world customer you are trying to help. Larus’ methodology is mention in his VCDX: Troubleshooting Scenario posting and Rene’s VCDX – Troubleshooting Scenario Strategy posting.
  • One resource that I felt was the best real world applicable resource I used that didn’t map perfectly to the VCDX scenario methodology but was brilliant was one that was recommended to me by Frank Buechsel who used to work for VMware GSS until recently was a book called Debugging—The Nine Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems. It’s more based at software development but each of the steps applies perfectly to troubleshooting any issues in a technology environment and now that the scenario has been stopped I can put out the loose outline from the book and kinds of questions I wrote up for each of the headings plus what I wanted to say to explain why i was asking in red that I wanted to ask in the scenario and how I thought it might fit:
    • UNDERSTAND THE SYSTEM
      • When did the problem start exactly?
      • What is meant to happen? – Why I am asking is because……………and what I’m hoping to achieve…………
      • When did you see the problem start happening? Is it recurring after a certain task or event or has it only happened once? – Why I am asking is because……………and what I’m hoping to achieve…………
      • Have any changes been made recently and are they tracked in a change management system? – Why I am asking is because……………and what I’m hoping to achieve…………
      • Have we collected logs or alerts from the systems and are we using something like vCOps where we can drill down and see alarms or alerts? ? – Why I am asking is because these mechanisms can give us ideas of the failures and possibly where it is happening if not just one location and what I’m hoping to achieve is to find the specific places the errors are showing, what the errors have been in the past if possible but also prepare for the next step of making it fail again so we can possibly see the error again or collect it for the first time.
    • MAKE IT FAIL
      • If it happens around a certain event can we try replicate the error and make it happen as often as possible? – Why I am asking is because I want to confirm the error is in fact happening at the point you mention and I’m hoping to achieve the exact step where it is happening and confirm if indeed our assumptions of when it is happening are true or not so we don’t waste time troubleshooting an assumption.
      • When are we doing the replication of the error can we document each step? – Why I am asking is because I want to confirm it is not just the step where it fails but the steps leading up to it in case a step in the sequence is then causing the eventual failure and I’m hoping to achieve the possible conflict or incorrect setting/step being followed.
    • QUIT THINKING AND LOOK
      • Are there any alarms or alerts on the source or destination system/s ? – Why I am asking is because I want to confirm not just the outcome of the failure that you mention but hopefully what is causing the failure and what I’m hoping to achieve is the point/component where we should do the troubleshooting so that we don’t make any unnecessary changes.
      • For the errors can we search the VMware/Vendor KB/Forums and see if any matches come up for some/all of the errors? – Why I am asking is because some of the errors might be known or even just give us an idea for where to look and what I’m hoping to achieve is to isolate the problem even more and not waste time looking at other components when a kb article might give us a good lead and save us precious time getting the issue fixed
    • DIVIDE AND CONQUER
      • For the machines that are failing are they the same configuration/going to the same location/coming from the same location/going over the same path? – Why I am asking is because I want to isolate the good parts/side and the bad parts/side and what I’m hoping to achieve is to focus my attention on the side that is showing the error so we don’t waste time and have less things to cover in the hope we can isolate the problem.
      • Can we try reverse the step in the opposite direction? – Why I am asking is because……………and what I’m hoping to achieve…………
    • CHANGE ONE THING AT A TIME
      • Try a migration/alteration/fix and if it doesn’t work then change it back and try something new. “Please can we migrate the failing machines to another host? “it still fails” Ok please move it back “– Why I am asking is because I don’t want to receive additional/red herring errors due to the change we made and what I’m hoping to achieve is to keep the environment unchanged as much as possible so we don’t cause additional errors/lose methods to troubleshoot.
    • KEEP AN AUDIT TRAIL (these were more writing out my thoughts and what I felt I needed to remember)
      • Write down what you did and the outcome and also WRITE DOWN THEIR RESPONSES as these may have clues!! “there are no errors in vSphere” might mean the error is not reaching vSphere for it to log the error so go “upstream” to find the source.
      • The error doesn’t sound like it is in vSphere so can we please look at the HBA on the host and ensure it is connected correctly and receiving data via ESXTOP.
    • CHECK THE PLUG
      • You state that the network connections are correct but please can we get it checked again? – Why I am asking is because I want to confirm that what we state is correct is in fact correct right now and what I’m hoping to achieve is to clear up any assumptions and have clear and confirmed facts about necessary “upstream” components.
      • Are the steps you are following worked in the past? Are we following the exact steps that worked before? – Why I am asking is because I want to confirm if it has ever worked/if we are following different processes and what I’m hoping to achieve is to confirm if it has ever worked and if a new step if causing the error to happen so we can troubleshoot what the different steps is bringing up.
    • GET A FRESH VIEW
      • Not really applicable to VCDX troubleshooting but asking for someone who is an SME in the customer might shed some new light/clear up what the exact problem is.
    • IF YOU DIDN’T FIX IT, IT AIN’T FIXED
      • Not really applicable to VCDX troubleshooting.

If you want to read about my utter joy about passing the VCDX then have a look at my VCDX #205 posting and also my VCDX Spotlight.

Next I’m hoping to dive deeper into each of the points from my VCDX #205 posting starting with VCDX Resources – Did you use them all??

Gregg


Leave a comment

VCAP5-DCA: Objective 6

Objective 6 is what I believe is the core basis of the DCA exam as being able to effectively troubleshoot anything in your environment means you know all the varying methods to do things and how things are tied together and that is real administration. For this section I’ve been doing loads and loads of lab time from just building my lab and coming across certain problems or failures along the way which I’ve been trying to fix via the vMA, the command line and even the DCUI and purposely breaking things/causing problems just so I can practice fixing them . I think the best way of really learning these skills is putting in a solid amount of time in your lab as I believe the reason I failed my VCAP4-DCA the first time was down to not enough lab hours. When you’re under the time and nerve constraints that are part of the VCAP-DCA you make mistakes you wouldn’t normally do. I have also re-watched the Trainsignal VMware vSphere Troubleshooting Training videos as David does a brilliant job covering it all.

There aren’t many differences between the VCAP4-DCA Objective 6 and the VCAP5-DCA Objective six but the differences between the two (thanks to Ed Grigson’s breakdown) are:

  • Use esxcli system syslog to configure centralized logging on ESXi hosts

This is different as with the vMA 5 the syslog feature has been deprecated due to the new VMware Syslog Collector now being available. The steps to do it now via esxcli are covered perfectly on pg10&11 in this VMware PDF. http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50-command-line-management-for-service-console-users.pdf

  • Install and configure VMware Syslog Collector and ESXi Dump Collector

This is really simple and is something you can learn to do very quickly in your lab. I tested this part during the building of my Auto Deploy testing. All the steps for the syslog collector are detailed here: http://blogs.vmware.com/esxi/2011/07/setting-up-the-esxi-syslog-collector.html and the steps for the dump collector are here: http://blogs.vmware.com/esxi/2011/07/setting-up-the-esxi-50-dump-collector.html

  • Configure and administer Port Mirroring

A new distributed vSwitch feature. Eric Sloof has done a brilliant video detailing how to do it here: http://www.ntpro.nl/blog/archives/1825-Video-How-to-setup-a-vSphere-5-Port-Mirror.html . Pretty simple to set up

  • Utilize Direct Console User Interface (DCUI) and ESXi Shell to troubleshoot, configure, and monitor ESXi networking

This is pretty straight forward I think as you need to know what kinds of things related to networking connectivity you can do via the DCUI (like restoring a standard switch) and how you can use the ESXi shell to change configurations/fix problems. I think this is all about lab playing and learning

  • Use esxcli to troubleshoot multipathing and PSA-related issues

The main difference here is now it is just esxcli so it’s all about being able to do things via esxcli. This part in particular was covered in objective 1.3

  • Use esxcli to troubleshoot VMkernel storage module configurations

Yet again this is down to your knowledge of how to troubleshoot the storage modules via esxcli. How to do this via esxcli is covered in the VMware documentation library here: http://pubs.vmware.com/vsphere-50/index.jsp?topic=/com.vmware.vcli.getstart.doc_50/cli_about.html

  • Use esxcli to troubleshoot iSCSI related issues

Another one where you will need to practice and learn how to do it via esxcli. All the commands and some great examples are all listed in the VMware documentation library here: http://pubs.vmware.com/vsphere-50/topic/com.vmware.vcli.examples.doc_50/cli_manage_iscsi_storage.7.5.html

  • Utilize Direct Console User Interface (DCUI) and ESXi Shell to troubleshoot, configure, and monitor an environment

This is down to playing around in your lab with it and knowing what kinds of troubleshooting you can do via the DCUI and the ESXi shell. This is VCP5 stuff so you should know this already

I’ve spent 70% of my lab time on this section as while building and trying out things in my lab I end up breaking things or things don’t work the first time and so I’ve been able to just mess around with all the tools and get it working again. As i said at the beginning, i think spending a large amount of time learning and trying out everything in this section is extremely important for the DCA exam.

Gregg