TheSaffaGeek

My ramblings about all things technical

VCAP5-DCD Design Practice

14 Comments

As some people may know I am currently preparing to re-take my VCAP5-DCD and I have reached the point in my preparations now where I am doing mock designs and also going through the labs from the VMware Design Workshop and so I thought I would follow the same idea and start creating a mock customer design scenario and also put down the same vein of questions I am being asked from the design workshop labs and hopefully if people are interested they can use it, write down what design choices,the justifications for these  choices and the impacts these choices create on the rest of the design and hopefully everyone will learn from this. Below is a company profile that I made up and I also used some ideas from a scenario Matt Mould one of my Xtravirt colleagues sent me as few months back:

Company Profile
•    Safe & Legit, are a global trading company – they specialise in ground defence equipment
•    13,000 physical servers across 9 sites.
o    6k  UK (3 sites)
o    2k  CN (3 sites)
o    5k  US (3 sites)
•    There are two level 4 DC’s per country (for info on DC levels see
http://en.wikipedia.org/wiki/Data_center
•    DC’s are linked by an MPLS cloud from BT, Verizon, Colt and NTT (contracts end Q1 2015)
•    One DC per country is privately owned and Safe & Legit want to retain the real estate, but make room to lease out sought after level 4 private suites, thus providing a new revenue stream, and hopefully make their own DC’s cost neutral in doing so. Therefore they are looking to virtualise as much of their physical estate as possible into vSphere 5.0
•    The remaining DC’s are rented from BT, Verizon and NTT (contracts end Q1 2015) . The CFO has voiced his desire to cut the cost of these rentals and would ideally like to not have to renew the contracts if possible.
•    ERP is centralised in the UK
•    Each country has locally hosted Print, Domain, UC & Messaging
•    Collaboration is centralised, again in the UK
•    Typical/normal file sharing is not permitted, all ‘matter’ is recorded and audited in Safe & Legit’s collaboration system
•    With the exception of ERP, all systems must move to a shared or distributed model. This is following a series of natural disasters in the US and China, that could have been avoided by having a DR and BC plan in place.
•    All communication end points are encrypted, but new legislation is relaxing where encryption is required. This is achievable following an ERP upgrade that separates out sensitive and non-sensitive data.
•    There are up to 5,000 3rd party users, that own a license to trade under Safe& Legit LLC, licensees are dropping as the competition develop newer, faster and cheaper ways to deliver access to their trading systems. Safe & Legit still require you to purchase expense fixed private comms to deliver their trading apps. They do not want these 3rd party users to be impacted at all during the migrations and for there to be a near zero RTO and RPO

•   The UK site has been chosen as the first site to be migrated but due to Safe and Legit’s work on ground defence equipment they have not authorised the running of a capacity planner collection as they don’t want their data to leave the premises but have calculated that for each site to be virtualised the environment must be able to meet the following values:

-The 6k physical servers in the UK are comprised of  2000 Linux servers and 4000 Windows servers

-On average each windows server is provisioned with 20GB boot disk (average used is 15GB) and a 50GB data disk (average used is 30GB)

– Each Linux server is configured with 60GB total storage (average used is 30GB)

– Safe and Legit expect a 10 percent annual server growth over the next three years

-Safe and Legit have a long standing vendor relationship with EMC and Cisco and so have requested the usage of their equipment due to this relationship and in house knowledge of the administration of these vendor products

-They have created the following two tables from internal analysis and monitoring:

CPU Resource Requirement
Metric Amount
Avg # of CPUs per physical server 4
Avg CPU MHz 3,400 MHz
Avg normalised CPU MHz 1,240
Avg CPU utilisation per physical system 5% (170 MHz)
Avg Peak utilisation per physical system 8% (272 MHz)
Total CPU resources req for 1k vm’s at peak 272,000 MHz
RAM Resource Requirement
Metric Amount
Avg amount of RAM per physical system 4096MB
Avg memory utilisation 30% (1228.8MB)
Avg Peak Memory Utilisation 80% ( 3276.8MB)
Total RAM required for 1k VMs at peak before memory sharing 3,276,800MB
Anticipated memory sharing benefit when virtualised 50%
Total RAM req for 1k VMs at peak with memory sharing 1,638.400MB

Business Requirements

From workshops and SME meetings the following requirements were collected

Number Requirement
R001 Virtualise the existing 6000 UK servers as virtual machines, with no degradation in performance when compared to current physical workloads
R002 To provide an infrastructure that can provide 99.7% availability or better
R003 The overall anticipated cost of ownership should be reduced after deployment
R004 Users to experience as close to zero performance impact when migrating from the physical infrastructure to the virtual infrastructure
R005 Design must maintain simplicity where possible to allow existing operations teams to manage the new environments
R006 Granular access control rights must be implemented throughout the infrastructure to ensure the highest levels of security
R007 Design should be resilient and provide the highest levels of availability where possible whilst keeping costs to a minimum
R008 The design must incorporate DR and BC practices to ensure no loss of data is achieved
R009 Management components must secured with the highest level of security
R010 Design must take into account VMware best practices for all components in the design as well as vendor best practices where applicable
R011 Any others you think I have missed from the scenario

Additional Functional Requirements (From Storage Design posting)

-5K 3rd party users will need to be able to gain access into the environment without any impact during the migration and consolidation

-Rented DC’s kit needs to be fully migrated to the privately owned datacenter before Q1 2015 to ensure the contracts don’t need to be renewed

Constraints and Risks

You tell me in the comments Smile

Constraints from Storage Design posting:

– Usage of EMC kit

– Usage of Cisco kit

– Usage of the privately owned DC’s physical infrastructure for the consolidation of all three UK DC’s.

Risks from Storage Design posting:

– The ability of ensuring near-zero downtime during the migration of workloads to the privately owned DC may be at risk due to budget constraints impacting the procurement of the required infrastructure to ensure zero downtime

Additional Questions (from Storage Design posting)

This is something I feel is really important when doing real world designs is trying to think of as many questions around a customer requirements so that you can ensure you have their requirements recorded correctly and that they aren’t vague.The additional questions and the answers to them are listed below:

Q: Is there any capability of utilising the existing storage in the privately owned UK DC?

A: Due to the consolidation and migration of  the other UK DC’s and the current workloads in the privately owned DC a new SAN is a better option due to the SAN being 3 years old now and so it is more cost effective to purchase a new one. Also due to the probable need for auto-tiered storage to meet the customers requirements a new SAN with these capabilities is needed

Q: Is there no way a minimal planned outage/downtime can be organised for the migration of the workloads due to the likely higher cost of equipment to ensure this near-zero downtime?

A: The customer would prefer to try keep to the near-zero downtime and so it is agreed that after the conceptual design of the storage and the remaining components in the whole design further meetings can be held to discuss a balance between cost and the desire for near-zero downtime

Q: With the leasing out of the private level 4 suites in the future will there be a requirement to manage/host other companies processes and data within this infrastructure being designed?

A: No there is currently no plan to do this due to security concerns and the number of compliancy regulations Safe and Legit need to maintain and fulfil. There is however a possibility of internal consumption and charging for usage of the DC’s resources to other departments.

Summary

So that is the company profile and my idea around it. I obviously created 90% of the above from my head so there will be additional questions around it but I think this gives a really solid amount of information for people to start thinking. I’m going to do the first posting around Storage Design for Safe and Legit quite soon and will put up what questions and component you normally have to think of but if people want to think of what they would choose prior then hopefully we can get a good discussion going around it.

As I add each section to the design I am hoping to keep updating this posting and then once complete making it all linked on a single page on my blog

Gregg

14 thoughts on “VCAP5-DCD Design Practice

  1. Pingback: Safe and Legit Storage Design « TheSaffaGeek

  2. Reblogged this on Jonathan Frappier's Blog.

  3. Gregg,

    Excellent start on this design, and a very ambitious design to say the least (at least relative to the environments I’m used to, which haven’t been many). You’ve done a good job at laying the framework of the design as well as listing out the requirements. To make it better I would suggest adding assumptions, constraints and risks.

    One thing I find interesting in the scenario is the near-zero down time to their customers, specifically when you start talking about the other DCs, and the possibility of converaging them to one central site.

    Overall I think this is a great scenario to start with, and it’s plenty large (that’s what she said). One thing I struggled with during my DCD prep was actually doing mock designs with the abscence of a design scenario. Again, great job Gregg and I look forward to follow-on posts.

    On last thing, the wikipedia DC link actually links to your Storage Design for this scenario instead of the wikipedia site.

    -Josh

    • Thanks Josh 🙂

      Yeah the reason I didn’t add the assumptions,constraints and risks is because I wanted people to look at the design scenario and try find some mentioned as I think this is a very important thing to do as identifying requirements is easy (IMO) but defining the constraints and risks and trying to limit the assumptions is extremely important

      Yep, the near-zero down time is something that is a challenge but is also very realistic to a lot of real world customer requests and I find it’s your job as the architect to weigh up all the options and then go back to them and show that if they want the near-zero down time then they will probably have to purchase more expensive kit to enable this and then it’s their choice against their budget if it is possible and what is acceptable compared to costs.

      It is the one thing I really wanted to try force myself to do in preparation for the DCD, my future VCDX design and defence and for my day job as a Virtualisation consultant

      Fixed, thanks for pointing that out

      Gregg

  4. Gregg,

    I definitely agree with you. I also think this will be very good practice for you during your VCDX design/defense; something I need to start on. One idea, although I’m not sure how much participation you will get, is to do a scenario like this as a contest and get 2-3 people together to judge it. Best design wins a prize? The better the prize, the more participation you’ll get. Anyways, just a thought.

    -Josh

  5. Pingback: Safe and Legit Storage Design Completed « TheSaffaGeek

  6. Hi Gregg
    Ambitious indeed – Should be a good one to follow. I am immediately getting a headache over your SSO and SRM config!
    Wondered, how did you get to 50% for anticipated memory sharing ?
    Simon

    • Hi

      Yep super ambitious. Well this is based on vSphere 5.0 not 5.1 so should be fine for SSO. SRM isn’t part of the sections I was planning to do but I might slot it into the management section

      Lol, there was no real math to it tbh, I just stated it due to thinking there would be a large amount of servers with the same OS so TPS would be able to claw back a large amount. Normally I would never believe I would get near 50% memory sharing.

      Gregg

  7. Hi Gregg,

    I see you set the anticipated memory sharing benefit when virtualised to 50%. I had this one in many of my designs too and I got asked recently why I put it there since I barely never used that value in any of my designs.
    The value is basically informational. This is not a hard value and depends on a lot of factors such memory contention level, mix of guest OS’es, type of MMU, vSphere version and configuration settings. This value becomes quite important when you plan on doing some serious memory oversubscription. That’s when TPS kicks in and this value becomes relevant in a design.

    Regarding you requirements R001 and R004. They look the same from my PoV. Basically you will design with no resources oversubscription right? To help you with that requirement you have collected some measurements for CPU and memory. But nothing regarding storage and networking. This is critical you have those measurements included in your design otherwise you cannot address R001 and other parts of your design such LUN/Datastore sizing and networking bandwidth requirements.

    You should set design qualities for each item you put in the requirements, constraints, assumptions and risks lists. For instance I’m using the following design qualities; Availability, Recoverability, Performance, Manageability and Security. You can add your additional qualities such Cost, Connectivity, etc.

    Also when you build up those lists through interviews, be sure that you’ve addressed the common parts of a design; that is compute, storage and network. And for completeness, you should also address the host layer, vCenter and the management layer, virtual machine/vApp layer, monitoring layer, operation layer, security layer, backup layer, etc.

    The ability of ensuring near-zero downtime during the migration of workloads is a killing point from a cost and design perspective. You set it as a risk but is there any business requirements associated with it? Anyone who has done some P2V migrations knows that there is downtime associated with it. Downtime before you do the P2V and even longer downtimes after the P2V job just for the clean up process and the multiple reboots. If near-zero downtime is required, look at an alternative to just plain P2V.

    I’m looking forward to reading your next posts 😉

    Cheers,
    Didier

  8. Pingback: VMware VCAP5-DCD exam experience | vKnowledge.net

  9. Pingback: VCAP-CID Objective 1.2 – Identify and Categorize Business Requirements | TheSaffaGeek

  10. Pingback: Mibol es hogyan keszulj VCAP5-DCD-re | vThing

  11. Pingback: My VCAP5-DCD exam experience - Default Reasoning

  12. Pingback: VMware VCAP5-DCD Design Practice – VEPSUN TECHNOLOGIES

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.