TheSaffaGeek

My ramblings about all things technical

VCDX Troubleshooting Skills

Leave a comment

55020378So this posting isn’t about my opinion of if the dropping of the scenario is good or bad as in all honesty more time in the design scenario sounds great to me. This posting is actually about the resources I used to prepare for my VCDX troubleshooting scenario that I think an architect should know and thereby any good VCDX should also.

  • The first resources I used were actually the ones I used in my preparations for the VCAP5-DCA as this really makes you learn where all the logs are, what methods there are of troubleshooting issues and what you might be looking for. My study resources list for the VCAP5-DCA is a great start and if you are at the point of defending for VCDX you should have used some of these in your preparations but what I went over again were the troubleshooting videos by David Davis. Even though they are old the methods in them still apply especially ESXTOP etc.
  • The next resources were a mix between my two mentors for my recent VCDX attempt aka Larus Hjartarson and Rene van den Bedem. Both of them did brilliant breakdowns of how to prepare and think during the scenario and the methodology you need to keep to. These methods give you a great plan of attack even if it is a real world customer you are trying to help. Larus’ methodology is mention in his VCDX: Troubleshooting Scenario posting and Rene’s VCDX – Troubleshooting Scenario Strategy posting.
  • One resource that I felt was the best real world applicable resource I used that didn’t map perfectly to the VCDX scenario methodology but was brilliant was one that was recommended to me by Frank Buechsel who used to work for VMware GSS until recently was a book called Debugging—The Nine Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems. It’s more based at software development but each of the steps applies perfectly to troubleshooting any issues in a technology environment and now that the scenario has been stopped I can put out the loose outline from the book and kinds of questions I wrote up for each of the headings plus what I wanted to say to explain why i was asking in red that I wanted to ask in the scenario and how I thought it might fit:
    • UNDERSTAND THE SYSTEM
      • When did the problem start exactly?
      • What is meant to happen? – Why I am asking is because……………and what I’m hoping to achieve…………
      • When did you see the problem start happening? Is it recurring after a certain task or event or has it only happened once? – Why I am asking is because……………and what I’m hoping to achieve…………
      • Have any changes been made recently and are they tracked in a change management system? – Why I am asking is because……………and what I’m hoping to achieve…………
      • Have we collected logs or alerts from the systems and are we using something like vCOps where we can drill down and see alarms or alerts? ? – Why I am asking is because these mechanisms can give us ideas of the failures and possibly where it is happening if not just one location and what I’m hoping to achieve is to find the specific places the errors are showing, what the errors have been in the past if possible but also prepare for the next step of making it fail again so we can possibly see the error again or collect it for the first time.
    • MAKE IT FAIL
      • If it happens around a certain event can we try replicate the error and make it happen as often as possible? – Why I am asking is because I want to confirm the error is in fact happening at the point you mention and I’m hoping to achieve the exact step where it is happening and confirm if indeed our assumptions of when it is happening are true or not so we don’t waste time troubleshooting an assumption.
      • When are we doing the replication of the error can we document each step? – Why I am asking is because I want to confirm it is not just the step where it fails but the steps leading up to it in case a step in the sequence is then causing the eventual failure and I’m hoping to achieve the possible conflict or incorrect setting/step being followed.
    • QUIT THINKING AND LOOK
      • Are there any alarms or alerts on the source or destination system/s ? – Why I am asking is because I want to confirm not just the outcome of the failure that you mention but hopefully what is causing the failure and what I’m hoping to achieve is the point/component where we should do the troubleshooting so that we don’t make any unnecessary changes.
      • For the errors can we search the VMware/Vendor KB/Forums and see if any matches come up for some/all of the errors? – Why I am asking is because some of the errors might be known or even just give us an idea for where to look and what I’m hoping to achieve is to isolate the problem even more and not waste time looking at other components when a kb article might give us a good lead and save us precious time getting the issue fixed
    • DIVIDE AND CONQUER
      • For the machines that are failing are they the same configuration/going to the same location/coming from the same location/going over the same path? – Why I am asking is because I want to isolate the good parts/side and the bad parts/side and what I’m hoping to achieve is to focus my attention on the side that is showing the error so we don’t waste time and have less things to cover in the hope we can isolate the problem.
      • Can we try reverse the step in the opposite direction? – Why I am asking is because……………and what I’m hoping to achieve…………
    • CHANGE ONE THING AT A TIME
      • Try a migration/alteration/fix and if it doesn’t work then change it back and try something new. “Please can we migrate the failing machines to another host? “it still fails” Ok please move it back “– Why I am asking is because I don’t want to receive additional/red herring errors due to the change we made and what I’m hoping to achieve is to keep the environment unchanged as much as possible so we don’t cause additional errors/lose methods to troubleshoot.
    • KEEP AN AUDIT TRAIL (these were more writing out my thoughts and what I felt I needed to remember)
      • Write down what you did and the outcome and also WRITE DOWN THEIR RESPONSES as these may have clues!! “there are no errors in vSphere” might mean the error is not reaching vSphere for it to log the error so go “upstream” to find the source.
      • The error doesn’t sound like it is in vSphere so can we please look at the HBA on the host and ensure it is connected correctly and receiving data via ESXTOP.
    • CHECK THE PLUG
      • You state that the network connections are correct but please can we get it checked again? – Why I am asking is because I want to confirm that what we state is correct is in fact correct right now and what I’m hoping to achieve is to clear up any assumptions and have clear and confirmed facts about necessary “upstream” components.
      • Are the steps you are following worked in the past? Are we following the exact steps that worked before? – Why I am asking is because I want to confirm if it has ever worked/if we are following different processes and what I’m hoping to achieve is to confirm if it has ever worked and if a new step if causing the error to happen so we can troubleshoot what the different steps is bringing up.
    • GET A FRESH VIEW
      • Not really applicable to VCDX troubleshooting but asking for someone who is an SME in the customer might shed some new light/clear up what the exact problem is.
    • IF YOU DIDN’T FIX IT, IT AIN’T FIXED
      • Not really applicable to VCDX troubleshooting.

If you want to read about my utter joy about passing the VCDX then have a look at my VCDX #205 posting and also my VCDX Spotlight.

Next I’m hoping to dive deeper into each of the points from my VCDX #205 posting starting with VCDX Resources – Did you use them all??

Gregg

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s