My ramblings about all things technical


vSphere 5.1 Announced with Enhanced vSphere Replication

vSphere Replication

vSphere Replication (VR) is the industry’s first and only genuinely hypervisor-level replication engine.

It is a feature first introduced with Site Recovery Manager 5.0 to allow for the vSphere platform to protect virtual machines natively by copying their disk files to another location where they are ready to be recovered.

VR is a software based replication engine that works at the host level rather than the array level.

Identical hardware is not required between sites, and in fact customers can run their VMs on any type of storage they choose at their site – even local storage on the vSphere hosts, and VR will still work.

It provides simple and cost-efficient replication of applications to a failover site

VR is a component delivered with vSphere editions of Essentials Plus and above, and also comes bundled with Site Recovery Manager. This offers protection and simple recoverability to the vast majority of VMware customers without extra cost.

•With VR, a virtual machine is replicated by components of the hypervisor, removing any dependency on the underlying storage, and without the need for storage-level replication.

•VMs can be replicated between *any* type of storage platform: Replicate between VMFS and NFS, from iSCSI to local disk. Because VR works above the storage layer it can replicated independently of the file systems. (It will not, however, work with physical RDMs.)

•Replication is controlled as a property of the VM itself and its VMDKs, eliminating the need to configure storage any differently or to impose constraints on storage layout or management. If the VM is changed or migrated then the policy for replication will follow the VM.

•VR creates a “shadow VM” at the recovery side, then populates the VM’s data through replication of changed data.

•While VR can be deployed through the “thick client” all management and interaction with VR is done strictly through the vCenter 5.1 web interface.

•Only vSphere 5.0 and 5.1 will work for vSphere Replication as the VR Agent is a component of the vSphere 5.x hypervisor.

•vSphere Replication can not co-exist with the vSphere Replication pieces originally shipped with SRM 5.0. If an existing SRM 5.0 vSphere Replication environment is in place it will need to be uninstalled and replaced with the standalone vSphere Replication from vSphere 5.1.

•While both Storage DRS and sVmotion are supported, they will cause certain scenarios to be aware of

•While Storage vMotion of a VR protected VM can be done by an administrator, on vSphere 5.0 this may create a “full sync” scenario in which a VM must be completely resynchronized between source and destination, possibly violating the configured recovery point objective for that VM.

•Storage DRS compounds this problem by automating storage vMotion, and thereby may potentially cause the protected virtual machines to create continual full sync scenarios, driving up I/O on the storage, thereby creating cyclical storage DRS events. Because of this it is unsupported with 5.0.

•Storage vMotion and SDRS are only able to be run on the *protected* VM and can not execute against the *replica* of the VM.

•When using vSphere Replication with Site Recovery Manager, storage vMotion and storage DRS are *not supported*

•Neither of these scenarios is true with vSphere 5.1 as the persistent state file that contains current replication data is migrated along with the rest of the VM, which did not occur in vSphere 5.0.

vSphere Replication is not “new” as it has more than a year-long track record of success with Site Recovery Manager.

VR is a non-disruptive technology: It does not use vSphere file-system snapshots nor impact the execution of the VM in any abnormal way.

Since VR tracks changes at a sub-VM level, but above the file system, it is completely transparent to the VM unless Microsoft Volume Snapshot Service is being used to make the VM quiescent. Even then VR uses fully standard VSS calls to the Microsoft operating system.

Virtual machines can be replicated irrespective of underlying storage type • Can use local disk, SAN, NFS, and VSA
• Enables replication between heterogeneous datastores
• Replication is managed as a property of a virtual machine

• Efficient replication minimizes impact on VM workloads

vSphere Replication Use Cases

Protecting VMs within a site, between sites, or to and from remote and branch offices.

Can use dissimilar storage, low cost NAS Appliances, even independent vSphere hosts with only local disk.

VR Deployment

VR is deployed via a standard virtual appliance OVF format.

The OVF contains all the necessary components for VR.

•What used to be both the “VRMS and VRS” in the SRM 5.0 implementation of VR are included in the “VR Appliance” now

•This allows a single appliance to act in both a VR management capacity and as the recipient of changed blocks

•Scaling sites is an easy task, simply deploy another VR Appliance at the target site and it will contain the necessary pieces to either pair and mange replication for a site or simply receive changed blocks as per the VRS

vSphere Replication Limitations

vSphere Replication is targeted at replicating the virtual disks of powered on virtual machines only. It is based on a disk filter to track changes that pass through it, therefore static images can not be tracked.

Powered-off or suspended VMs will not be replicated. The assumption is that if the VM is important enough for protection, it is powered on.

That also means non-disks attached to a VM (ISOs, floppy images, etc) are not replicated. Also any disks, ISOs, or configuration files not associated with a VM will not be replicated.

Files that moreover are not required for the VM to restart (e.g., vswp files or log files) are not replicated by VR.

Since VR works above the disk itself at the virtual device layer, it can be completely independent of specifics about the VMDK it is replicating. VR can replicate to a different format than its primary disk – i.e. you can replicate a thick provisioned disk to be a thin provisioned replica.

VM snapshots in and of themselves are not replicated but instead are collapsed during replication. A VM with snapshots may be configured for protection by VR (and you can take and revert snapshots), but the remote state for such VMs will be “flat” without any snapshots. Snapshots are aggregated into a single VMDK at the recovery location.

Note: Reverting from a snapshot may cause a full sync!

VMs can be replicated with a recovery point objective (RPO) of at most 15 minutes and at least 24 hours. This means that a recovery of replicated VMs will lose at least 15 minutes worth of recent data.

How it works

Fundamentally VR is a handful of virtual appliances that allow the vSphere kernel to identify and replicate changed blocks between sites. The configuration and deployment is a handful of simple steps.

Once the administrator has deployed the components it is a matter of pairing a source and destination.

Lastly, configuration of an individual VM for protection tells VR to start replicating its changes, and where to put them at the recovery location.

Only replicates changed blocks

On an ongoing basis, after the first sync, VR will only ship changed blocks.

Within the RPO defined by the administrator, VR tracks which blocks are being dirtied and will create a “lightweight delta” (LWD) bundle of data to be transferred to the remote site.

Pointers to changed blocks are kept in both a memory bitmap as well as a “persistent state file” (psf) located in the directory of a VM. Memory contents are always current, the PSF file represents the current shipping LWD. After an LWD is shipped and completely acknowledgd, the memory bitmap is copied to the PSF file and the memory bitmap is restarted for the next LWD.

VR will use the defined RPO to determine how often to create a LWD. Time must be allowed to create the block bundle, transfer it, and successfully complete writing the entire bundle to ensure that the RPO is not violated. In order to do this, VR will track the length of the previous 15 transfers to create an estimate of how long it will take to complete the transaction of the subsequent LWD.

For example, if a transfer takes 1 minute to create, 8 minutes to transfer, and 1 minute to write, by the time the data is successfully written the original VM is now 10 minutes old. With, for example, a 1 hour RPO set for a VM, the next transfer would need to take place at least within the next 40 minutes. This presumes 10 minute old data plus the next 10 minute transfer = 20 minutes gone out of the 1 hour RPO to ensure the data at the recovery site is never older than the RPO defined.

If a transfer of a LWD takes more than half the time of the RPO it is very likely that the RPO will be violated based on the incremental “catch up” to the RPO period and it will be flagged as a potential RPO violation.

VR will create a per-host replication schedule by taking into account *all* the VMs being replicated from that particular host. This allows it to do host-wide scheduling for each replicated VMDK and allows transfers to take place according to variables such as length of transfer, size of LWD, etc. and gives the scheduler flexibility to send data when appropriate.

The scheduler will execute each time an event occurs that alters replication patterns, such as a power task on a replicated VM, changes to RPOs or a full sync, or an HA event such as a host crash.

Only the most-recent transfer information is persisted. If hostd crashes, or the VM is migrated, or reconfigured, the historic transfer state is lost, and must be re-accumulated for the scheduler to be most effective.

It is important to note that VR is *not* using vSphere based snapshots to create redo logs of the primary VMDK. The VMDK is not interrupted in any fashion at all, and there is no snapshot created.

It also does NOT use “CBT” or “Changed Block Tracking”, another feature of the vSphere Platform. The vSCSI filter of VR is completely independent of CBT by design. This allows CBT to remain untouched for other tools such as VADP and backup software. If CBT were to be used it would reset the changed block tracking epoch, breaking backups and other uses of CBT.

VR is 100% isolated from snapshots and CBT.

Recovering a VM with a few clicks

A VM can be recovered only if it is not powered on somewhere else or is not reachable by the recovery vCenter Server. This is to avoid having duplicate VMs running at the same time.

For further safety, the VM is booted with no networks connected to help avoid duplicate VMs colliding.

Once the recovery is processed, you can not reconnect and re-enable replication of that VM. You must re-start protection all over again. You may, however, use the old VMDK that might remain at either site as a seed to begin replication again.

Four steps for full recovery

As long as the replication has completed at least once a VM can be recovered quickly and easily directly from the vCenter Web Client.

From the Replication location in the Web Client, choose a VM that has been replicated, right-click and choose to recover.

Choosing a target folder and resource (Cluster, host, or resource pool) will then instantiate the replicated vm, create and register the vmx, attach the VMDK and power-on the VM if chosen.

This can not be automated, and can only be done a single VM at a time.

Leave a comment

vSphere 5.1 Announced with Site Recovery Manager 5.1

With the announcement of vSphere 5.1 is also the announcement of Site Recovery Manager 5.1. Below are some of the new features and enhancements coming with SRM 5.1

Application Quiescence for vSphere Replication

The new VR has improved VSS integration and doesn’t merely request OS quiescence, but flushes app/db writers if present.

This is due to better handling of VSS through the VMware Tools present in vSphere 5.1 and requires no work to configure – merely select the quiescing method and VR will handle it.

If VR is asked to use VSS, it will synchronize its creation of the lightweight delta with the request to flush writers and quiesce the application and operating system. This ensures full app consistency for backups.

vSphere Replication is presented the quiescent and consistent volume produced by the OSS flushing the VSS writers, and that consistent volume is used to create the LWD for replication.

If for some reason the VSS can not quiesce correctly or flush the writers, VR will continue irrespective of the failure and create an OS consistent LWD bundle at the VM level, and generate a warning that VSS consistency was not able to be created.

All Paths Down Improvements

The way vSphere 5 handles hosts with devices in an “All Paths Down” state has been improved to ensure that the host does not get stuck in a loop attempting I/O on unavailable devices.

APD states often occur during disaster scenarios, and as such it becomes important for SRM that the platform not cause delay for recovery.

SRM now checks for a datastore’s accessibility flag before deciding whether or not to attempt to use that datastore. A datastore may become inaccessible because of various reasons, one of which is APD.

The changes in how vSphere handles these devices enables SRM to differentiate APD from other types of inaccessible states such as and Permanent Device Loss (PDL).

If SRM sees a datastore in an APD condition, it will stop immediately and try again later, since APD conditions are supposed to be transient, rather than time out trying to access a missing device.

SRM also has been improved to use a new unmount command to gracefully remove datastores from the primary protected site during the execution of a recovery plan. Since SRM needs to break replication and unmount the datastore from the protected environment the new method allows for a graceful dismount and generation of an APD situation rather than an abrupt removal of the datastore.

During a disaster recovery, however, in some cases hosts are inaccessible via network to gracefully unmount datastores, and in the past the isolated hosts could panic if their storage was removed abruptly by SRM.

With vSphere 5.1 there are new improvements to the hosts and storage stacks that allow them to remain operative even through an unplanned APD state.

Forced Failover

Forced failover was introduced in SRM 5.0.1 for recovery plans using array based replication protection groups. With SRM 5.1 forced failover is now fully supported for all protection group types.

In some cases SRM will be unable to handle storage failure scenarios at the protection site. Perhaps the devices have entered an APD or PDL state, or perhaps storage controllers are unavailable, or for many other reasons. Perhaps the original SAN is reduced to a puddle of molten slag.

In these cases, SRM can enter a state where it waits for responses from the storage for an untenable amount of time. For instance, timeouts have been seen to last as long as 8 hours while waiting for responses from ‘misbehaving’ storage at the protected site.

Forced failover handles these scenarios. If storage is in a known inconsistent state, a user may choose to run a recovery plan failover in “forced failover” mode. Alternately, if a recovery plan is failing and timing out due to unresponsive protected site storage, the administrator could cancel the running recovery plan and launch it again in forced failover mode.

Forced failover will run *only* recovery-side operations of the recovery plan. It will not attempt any protected site operations such as storage unmounts or VM shutdowns. During a forced failover execution of a recovery plan any responses generated by the protected site are completely ignored.

Array-based replication forced failover worked with SRM 5.0.1, and after extensive testing has now been introduced to work with vSphere Replication as well.

Failback supported with both Array and vSphere Replication

SRM 5.1 now includes vSphere Replication in the “automated failback” workflow!

With SRM 5 VMware introduced the “Reprotect” and failback workflows that allowed storage replication to be automatically reversed, protection of VMs to be automatically configured from the “failed over” site back to the “primary site” and thereby allowing a failover to be run that moved the environment back to the original site.

Taken together as “automated failback” this feature was well received by those using array-based replication, but was unavailable for use with vSphere Replication.

With SRM 5.1 users can now do automated reprotects and run failback workflows for recovery plans with any type of protection group, both VR and ABR inclusive.

After running a *planned failover only*, the SRM user can now reprotect back to the primary environment:

Planned failover shuts down production VMs at the protected site cleanly, and disables their use via GUI. This ensures the VM is a static object and not powered on or running, which is why we have the requirement for planned migration to fully automate the process.

The “Reprotect” button when used with VR will now issue a request to the VR Appliance (VRMS in SRM 5.0 terminology) to configure replication in opposite direction.

When this takes place, VR will reuse the same settings that were configured for initial replication from the primary site (RPO, which directory, quiescence values, etc.) and will use the old production VMDK as seed target automatically.

VR now begins to replicate replicate back to the primary disk file originally used as the production VM before failover.

If things have gone wrong at the primary site and an automatic reprotect is not possible due to missing or bad data at the original site, VR can be manually configured, and when the “Reprotect” is issued SRM will automatically use the manually configured VR settings to update the protection group.

Once the reprotect is complete a failback is simply the process of running the recovery plan that was used to failover initially.

vSphere Essentials Plus Support

SRM 5.1 is now supported with vSphere Essentials Plus, enabling smaller companies to move towards reliable disaster recovery protection for their sites.

•vCenter version 5.1 is the only version that will work with SRM 5.1. Lower versions of vSphere/VI are supported, but vCenter must be up to date.

•At time of shipping, only vSphere 4.x and 5.x are supported.

•ONLY ESXi 5.0 and 5.1 will work for vSphere Replication as the VR Agent is a component of the ESXi 5.x hypervisor.

•While both Storage DRS and sVmotion are not supported with SRM 5.1, they will work in some scenarios even though unsupported.

•While Storage vMotion with array-replicated protected VMs can be done by an administrator, they must then ensure that the target datastore is replicated and that the virtual machine is once again configured for protection. Because this is a very manual process it is not officially supported.

•Storage DRS compounds this problem by automating storage vmotion, and thereby will cause the VMDK of the protected virtual machines to migrate to potentially un-protected storage. Because of this it is unsupported with SRM 5

•Storage vMotion and Storage DRS are not supported at all with SRM 5 using vSphere Replication as migration of a VMDK will cause the migrated VM to reconfigure itself for protection, potentially putting it in violation of its recovery point objective.