TheSaffaGeek

My ramblings about all things technical

Safe and Legit Storage Design Completed

13 Comments

Below is my thoughts, additional questions I felt needed to be asked/things to be clarified and the Design decisions,justifications and impacts due to these decisions for the Safe and Legit Storage design. If you missed the posting where I detailed the mock scenario you can read it here 

 

Note: This is a learning exercise for me so if you feel I’ve missed something or made a wrong decision then please write it in the comments and I’m more than happy (it was one of the main reasons I’m looking to do this series of postings) to discuss and I’ll amend the design accordingly if it makes sense and hopefully I along with other people reading these postings will learn from it and become better.

 

Additional Questions

As I said there probably would be and which is something I feel is really important when doing real world designs is trying to think of as many questions around a customer requirements so that you can ensure you have their requirements recorded correctly and that they aren’t vague.The additional questions and the answers to them are listed below:

 

Q: Is there any capability of utilising the existing storage in the privately owned UK DC?

 

A: Due to the consolidation and migration of  the other UK DC’s and the current workloads in the privately owned DC a new SAN is a better option due to the SAN being 3 years old now and so it is more cost effective to purchase a new one. Also due to the probable need for auto-tiered storage to meet the customers requirements a new SAN with these capabilities is needed

 

Q: Is there no way a minimal planned outage/downtime can be organised for the migration of the workloads due to the likely higher cost of equipment to ensure this near-zero downtime?

 

A: The customer would prefer to try keep to the near-zero downtime and so it is agreed that after the conceptual design of the storage and the remaining components in the whole design further meetings can be held to discuss a balance between cost and the desire for near-zero downtime

 

Q: With the leasing out of the private level 4 suites in the future will there be a requirement to manage/host other companies processes and data within this infrastructure being designed?

 

A: No there is currently no plan to do this due to security concerns and the number of compliancy regulations Safe and Legit need to maintain and fulfil. There is however a possibility of internal consumption and charging for usage of the DC’s resources to other departments.

 

Q: What other questions do you feel should be asked?

Additional Functional Requirements

-5K 3rd party users will need to be able to gain access into the environment without any impact during the migration and consolidation

-Rented DC’s kit needs to be fully migrated to the privately owned datacenter before Q1 2015 to ensure the contracts don’t need to be renewed

Constraints

Below are the constraints I felt were detailed in the scenario. These will possibly change as I go further through all the other sections but so far these are the ones I felt were applicable:

– Usage of EMC kit

– Usage of Cisco kit

– Usage of the privately owned DC’s physical infrastructure for the consolidation of all three UK DC’s.

Assumptions

Below are the assumptions I felt had to be made. These will possibly change as I go further through all the other sections and normally I try to keep these as minimal as possible but for a project of this size it would be extremely difficult to not have any as you do have to trust certain things are in place:

– There is sufficient bandwidth between the UK DC’s to allow migration of the existing workloads with as little of an impact to the workloads as possible

All required upstream dependencies will be present during the implementation phase.

– There is sufficient bandwidth into and out of the privately owned DC to support the bandwidth requirements of all three DC’s workloads

– All VLANs and subnets required will be configured before implementation.

Storage will be provisioned and presented to the VMware ESX™ hosts
accordingly.

– Power and cooling in the privately owned DC is able to manage the addition of the required physical infrastructure of the Virtual Infrastructure whilst for a certain amount of time having older physical machines still running alongside

– Safe and Legit have the existing internal skillset to support the physical and virtual infrastructure being deployed.

– There are adequate licences for required OS and applications required for the build

 

Risks

– The ability of ensuring near-zero downtime during the migration of workloads to the privately owned DC may be at risk due to budget constraints impacting the procurement of the required infrastructure to ensure zero downtime

Storage Array

Design Choice EMC FC SAN with two x8GB SP
   
Justification -EMC due to constraint of having to use EMC storage due to previous usage
-EMC VNX 5700 with Auto-Tiering enabled
– 8GB to ensure high transmission speeds to the storage,12GB is too high and expensive for this design
   
Design Impacts -Switches will need to be capable of 8GB connectivity
– FC Cabling needs to be capable of transmitting 8GB speeds
-HBA’s on ESXi hosts need to be capable of 8GB speeds
   

Number of LUNs and LUN sizes

Design Choice 400 x 1TB LUNs will be used
   
Justification -Each VM will be provisioned with 50GB average of disk
-So with around 15 vm’s per lun + 20% for swap and snapshots, 15x 50GB / .8 = 937.5
– So 6000 total VM’s / 15 VMs per LUN = 400 LUNs
   
Design Impacts -Tiered storage will be used with auto tiering enabled to balance storage costs with VM performance requirements
   

Storage load balancing and availability

Design Choice -EMC PowerPath/VE multipathing plug-in (MPP) will be used.
   
Justification

-EMC PowerPath/VE leverages the vSphere Pluggable Storage Architecture (PSA), providing performance and load-balancing benefits over the VMware native multipathing plug-in (NMP).

   
Design Impacts -Requires additional cost for PowerPath licenses.
   

VMware vSphere VMFS or RDM

Design Choice -VMFS will be used as the standard unless there is a specific need for raw device mapping . This will be done on a case by case basis
   
Justification

-VMFS is a clustered file system specifically engineered for storing virtual machines.

   
Design Impacts -Usage of the VMware vSphere Client to create the datastores must be done to ensure correct disk alignment
   

Host Zoning

Design Choice

-Single-initiator zoning will be used. Each host will have two paths to the storage ports across separate fabrics.

   
Justification -This is keeping to EMC best practices and ensures no single point of failure with multiple paths to targets across multiple fabrics
   
Design Impacts -Zones will need to be created for each portion by the storage team
   

LUN Presentation

Design Choice

-LUNs will be masked consistently across all hosts in a cluster.

   
Justification -This allows for virtual machines to be run on any host in the cluster and ensures both HA and DRS optimisation
   
Design Impacts -The storage team will need to control and deploy this due to the masking being done on the storage array
   

Thick or Thin disks

Design Choice -This provisioning will be used as the standard unless there is a specific need for thick provisioned disks . This will be done on a case by case basis
   
Justification

-The rate of change for a system volume is low, while data volumes tend to have
a variable rate of change.

   
Design Impacts -Alarms will need to be configured to ensure that if disks reach an out of space condition there is ample time to provision more storage
   

Virtual Machine I/O Priority

Design Choice -Storage I/O Control will not be used
   
Justification -This is due to the storage utilising Auto-Tiering/FAST which works at the block level to balance and is therefore a better way of balancing
– Due to the likelihood that VMware SRM is going to be used then SDRS and SIOC is not supported
   
Design Impacts – FAST/Auto-Tiering will need to be configured correctly by the storage vendor
   

Storage Profiles

Design Choice -Storage Profiles will not be configured
   
Justification -Storage will be managed by the storage team
   
Design Impacts -Storage team will need to configure storage as the virtual infrastructure requires
   

Describe and diagram the logical design

Attribute Specification
Storage Type Fibre Channel
Number of Storage Processors 2 to ensure redundancy
Number of Fibre Channel Switches (if any) 2 to ensure redundancy
Number of ports per host per switch 1
Total number of LUNs 400 (as mentioned above)
LUN Sizes 1TB (as mentioned above)
VMFS datastores per LUN 1

image

Describe and diagram the physical design

Array vendor and model EMC VNX 5700
Type of array Active-Active
VMware ESXi host multipathing policy PowerPath/VE MPP
Min/Max speed rating of storage switch ports 2GB/8GB

I’m looking for the correct EMC diagrams to create the physical design diagram  so will update this postings this week with the diagram promise Smile

Well that’s my attempt at the storage design portion of Safe and Legit. Hopefully people will agree with most of the decisions I’ve made if not all of them and I have to admit it took me most of my Sunday just to do this piece and think of all the impacts and as stated there may be additional constraints and risks further down the line.

 

Gregg

Advertisements

13 thoughts on “Safe and Legit Storage Design Completed

  1. Gregg,
    this exercise is indeed a good way to train.

    My comments :
    – I think the main design choice is missing : how will you explain that you choose FC against NFS ?
    – what about the RAID level of the datastores ?
    – your total space is 400TB :
    > on the EMC VNX 5700 you can put only 500 disks max (I don’t know this storage, I just read the doc)
    > middle range (10k) disks are 900GB max (around 830 formatted)
    > 830 GB * 500 disks : around 415TB raw and you need to add spare disk, RAID parity, no ?
    > so you need to have more nearline disks (with higher capacity)

    this leads to my next remark :
    – you don’t have latency / IOPS requirements ? Maybe they are missing from the VM assessment ?

    Nb : this is my point of view, I understand that an artificial design is really difficult ! 😉

    • Hi Romain

      Thanks for the comment =0)

      -Good point I didn’t actually stipulate why I chose FC. I’ll look to add that although with your other points I may have to change the model so I might go down the NFS route. For a quick covering I chose FC because of the server hardware I had in mind but it is a good point of how a design has to be holistically designed so if I only design with one portion in mind (so far storage) then it will impact other portions
      -Hmmm I left this out on purpose as I didn’t stipulate IOPS/Latency as you rightly pointed out. I might try create this then amend the design appropriately
      -Yep I may need to amend it as I didn’t do the research of max drives tbh

      Thanks for point of view, it’s the kind of challenging I needed 😀

      Gregg

  2. Hi Gregg, great series of articles, keep them coming!

    I don’t know how many clusters you will create in your design, but with so many LUNs I would add as a constraint regarding vSphere 5.1 storage maxmums of 1024 paths and 256 LUN’s.

    I bite my tong if you have already mentioned that constraint 🙂

    • Hi Didier

      Yep very good point, I had the idea in my mind for the hosts design portion to keep the number of hosts to a reasonable number so without having designed it yet and off my head I’m likely to keep around 12 hosts per cluster but that may change as it is just off my head

      Thanks for the feedback 🙂

      Gregg

  3. Pingback: VCAP5-DCD Design Practice | TheSaffaGeek

  4. Hi Gregg,

    EMC VNX 5700 is not active active array. It is active passive array.

  5. hi gregg. excellent post.
    this is a very interesting way to discuss some design ideas.

    since we are here to discuss, here are my opinions:

    First: I think this question of 400LUN’s should be reviewed because of the limitation of 256 LUN’s in vSphere 5.0, as already said Didier Pironet. One thing that can help reduce the amount of LUN’s needed is the disk type that will be used at storage (SSD, SAS, SATA), maybe the three will be, and also the type of RAID, which can help achieve a consolidation ratio of VM’s/datastore greater. I think this option for 15 VMs / LUN is a bit old, today you can get better rates. It is also important to think in the future if you use all 256 LUN’s at once, you will have nowhere to grow.

    second: in an environment with so many LUN’s and consequently many datastores, I would not use THIN disks, even more that don’t already mention what management tools you will use. Without a good tool it is very difficult to monitor the use of thin disks in so many datastores.

    Thanks for this opened discussion. i sure that all readers will learn a lot with this post series.

    • Hi Tiago

      thanks, hopefully it helps everyone including me to learn more

      – Yep there is the limitation as Didier mentioned and as i stated i did have a certain number of hosts per cluster in mind for the host design but looking at the type of disk is an option of how to get more vm’s onto the storage. It was a hard one as I never defined their IOPS/Latency requirements which I might try define so that I can do the disk design better.
      – the 15 VM’s per LUN average was just down to space requirements compared to amount of storage and if you feel thin is a bad idea then the storage will be massively filled which from experience will then bring in the constraint of costs. The monitoring of thin isn’t too difficult and true i didn’t stipulate the exact alarms I would create but setting alarms is a very standard practice for the provisioning of storage via the thin method.

      No worries 🙂 thanks for the thoughts and keep them coming

      Gregg

  6. Hiya Gregg, Didn’t seem like you took the 10% annual server growth into consid

  7. Hey Greg,

    First off thanks for all the info you have here compiled for the DCD, it’s really been a tremendous resource. Here are a few of my thoughts in addition to what everyone else has posted. It ended up being a lot longer than I thought, but I’m attempting the exam tomorrow so a lot of stuff is already flowing through my brain. Also, this was really a great exercise for me to go through and start putting pieces together, so thanks again for that!

    Storage Array:
    – What kind of array do they currently have, specifically Clariion or Symmetrix? This would be a major factor because there are restrictions on LUN migrations between systems that you may run into. I know Clariion->VNX works and DMX->VMAX works, but going DMX->VNX or Clariion->VMAX I’m pretty sure have restrictions.
    – I also believe some of the tools and commands are different between the different array families, so factoring this into the design to overlap with the existing knowledge of the storage staff would be important. Admittedly there will be new things to learn anyways because of the new generation of SAN, but the curve may be significantly different.

    Additional information needed on servers for HBA decisions:
    – The constraints mention EMC and Cisco as the brands, but considering it doesn’t sound like any of their environment is currently virtualized and the primary use case for UCS has been for virtualization, I doubt they would actually have UCS servers, so it should be added to the questions for the customer (I also doubt they have 13,000 physical UCS blades), so getting info on rack vs blade would be important.
    – Based on the above I’ll make the assumption that they currently don’t have UCS servers but want to make the migration towards them as part of the scope of this project because no other server vendor was mentioned from what I saw. Determining if they want to use blade or rack mount servers is important because one will pose additional constraints on the environment (blades) whereas the other one won’t (rack mount). From a hardware standpoint the rack servers would let you use any PCIe device you want, so you don’t have to worry about them. The blade servers though need to factor in that whether you like it or not, you’re a proud new adopter of FCoE. If Microsoft clustering is used that requires the VMs to access an RDM, VMware ONLY supports FCoE when using the Cisco VIC cards going to a Cisco UCS chassis from what I’ve read, so you’ll have to add this to your list of constraints.
    – Based on the above, if the customer decides to go with Cisco UCS blade servers this does introduce more options for the storage to consider, specifically between FC and FCoE connectivity to the new SAN. I know the latest FlexPod reference architecture with NetApp recommends using FCoE adapters out of a NetApp array, so looking into this on EMC would be something to research.

    Number of LUNs and LUN sizes:
    – I think this is something that should have more consideration behind, especially when it comes to whether or not different applications will have different performance requirements. While storage tiering is great, there will still be times where cold data needs to be accessed and factoring this into what kinds of disks/RAID groups sit behind the flash tier would still be important. The VNX5700 supports up to 1500GB of FAST Cache which would give you a 400:1.5 ratio of SSD vs HDD, AND eat up 30 of the available 500 slots (https://community.emc.com/docs/DOC-11313). Based on this, going with a bigger SAN and/or multiple SANs may be something to consider.

    VMware vSphere VMFS or RDM:
    * The below comments are more related to the migration process of P2Ving the servers, and not really part of the final picture.
    – Up until recently one thing to consider here would be whether or not RDMs were needed as part of the migration process because of limitations in VMware Converter with regards to GPT disks, where I needed to actually attach a LUN to a VM as a vRDM and then do a storage vMotion on the vRDM to convert it into a VMDK.
    – I’ve only done this with the one VM before so don’t quote me on risks and scale, but this method could possibly be factored in during the migration if converting the LUNs attached to the physical servers would lead to an unacceptable maintenance window. P2V’ing just the OS volumes and then attaching the data volumes as vRDMs to do the migration may be an option.

    Virtual Machine I/O Priority:
    – Based on the desire to move to a shared/distributed model, many of the applications may not need SRM, and of those that do you’d want to factor the limitation of being unable to use SIOC/SDRS into the LUN design, but I don’t expect FAST to address all of the issues, because it won’t have the per-VM insight available that SIOC has. I mostly just imagine this will be a much more complicated decision and that more questions about the application requirements and behaviors should be asked.

  8. Pingback: Mibol es hogyan keszulj VCAP5-DCD-re | vThing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s