My ramblings about all things technical

Leave a comment

vSphere 5.1 Announced with Site Recovery Manager 5.1

With the announcement of vSphere 5.1 is also the announcement of Site Recovery Manager 5.1. Below are some of the new features and enhancements coming with SRM 5.1

Application Quiescence for vSphere Replication

The new VR has improved VSS integration and doesn’t merely request OS quiescence, but flushes app/db writers if present.

This is due to better handling of VSS through the VMware Tools present in vSphere 5.1 and requires no work to configure – merely select the quiescing method and VR will handle it.

If VR is asked to use VSS, it will synchronize its creation of the lightweight delta with the request to flush writers and quiesce the application and operating system. This ensures full app consistency for backups.

vSphere Replication is presented the quiescent and consistent volume produced by the OSS flushing the VSS writers, and that consistent volume is used to create the LWD for replication.

If for some reason the VSS can not quiesce correctly or flush the writers, VR will continue irrespective of the failure and create an OS consistent LWD bundle at the VM level, and generate a warning that VSS consistency was not able to be created.

All Paths Down Improvements

The way vSphere 5 handles hosts with devices in an “All Paths Down” state has been improved to ensure that the host does not get stuck in a loop attempting I/O on unavailable devices.

APD states often occur during disaster scenarios, and as such it becomes important for SRM that the platform not cause delay for recovery.

SRM now checks for a datastore’s accessibility flag before deciding whether or not to attempt to use that datastore. A datastore may become inaccessible because of various reasons, one of which is APD.

The changes in how vSphere handles these devices enables SRM to differentiate APD from other types of inaccessible states such as and Permanent Device Loss (PDL).

If SRM sees a datastore in an APD condition, it will stop immediately and try again later, since APD conditions are supposed to be transient, rather than time out trying to access a missing device.

SRM also has been improved to use a new unmount command to gracefully remove datastores from the primary protected site during the execution of a recovery plan. Since SRM needs to break replication and unmount the datastore from the protected environment the new method allows for a graceful dismount and generation of an APD situation rather than an abrupt removal of the datastore.

During a disaster recovery, however, in some cases hosts are inaccessible via network to gracefully unmount datastores, and in the past the isolated hosts could panic if their storage was removed abruptly by SRM.

With vSphere 5.1 there are new improvements to the hosts and storage stacks that allow them to remain operative even through an unplanned APD state.

Forced Failover

Forced failover was introduced in SRM 5.0.1 for recovery plans using array based replication protection groups. With SRM 5.1 forced failover is now fully supported for all protection group types.

In some cases SRM will be unable to handle storage failure scenarios at the protection site. Perhaps the devices have entered an APD or PDL state, or perhaps storage controllers are unavailable, or for many other reasons. Perhaps the original SAN is reduced to a puddle of molten slag.

In these cases, SRM can enter a state where it waits for responses from the storage for an untenable amount of time. For instance, timeouts have been seen to last as long as 8 hours while waiting for responses from ‘misbehaving’ storage at the protected site.

Forced failover handles these scenarios. If storage is in a known inconsistent state, a user may choose to run a recovery plan failover in “forced failover” mode. Alternately, if a recovery plan is failing and timing out due to unresponsive protected site storage, the administrator could cancel the running recovery plan and launch it again in forced failover mode.

Forced failover will run *only* recovery-side operations of the recovery plan. It will not attempt any protected site operations such as storage unmounts or VM shutdowns. During a forced failover execution of a recovery plan any responses generated by the protected site are completely ignored.

Array-based replication forced failover worked with SRM 5.0.1, and after extensive testing has now been introduced to work with vSphere Replication as well.

Failback supported with both Array and vSphere Replication

SRM 5.1 now includes vSphere Replication in the “automated failback” workflow!

With SRM 5 VMware introduced the “Reprotect” and failback workflows that allowed storage replication to be automatically reversed, protection of VMs to be automatically configured from the “failed over” site back to the “primary site” and thereby allowing a failover to be run that moved the environment back to the original site.

Taken together as “automated failback” this feature was well received by those using array-based replication, but was unavailable for use with vSphere Replication.

With SRM 5.1 users can now do automated reprotects and run failback workflows for recovery plans with any type of protection group, both VR and ABR inclusive.

After running a *planned failover only*, the SRM user can now reprotect back to the primary environment:

Planned failover shuts down production VMs at the protected site cleanly, and disables their use via GUI. This ensures the VM is a static object and not powered on or running, which is why we have the requirement for planned migration to fully automate the process.

The “Reprotect” button when used with VR will now issue a request to the VR Appliance (VRMS in SRM 5.0 terminology) to configure replication in opposite direction.

When this takes place, VR will reuse the same settings that were configured for initial replication from the primary site (RPO, which directory, quiescence values, etc.) and will use the old production VMDK as seed target automatically.

VR now begins to replicate replicate back to the primary disk file originally used as the production VM before failover.

If things have gone wrong at the primary site and an automatic reprotect is not possible due to missing or bad data at the original site, VR can be manually configured, and when the “Reprotect” is issued SRM will automatically use the manually configured VR settings to update the protection group.

Once the reprotect is complete a failback is simply the process of running the recovery plan that was used to failover initially.

vSphere Essentials Plus Support

SRM 5.1 is now supported with vSphere Essentials Plus, enabling smaller companies to move towards reliable disaster recovery protection for their sites.

•vCenter version 5.1 is the only version that will work with SRM 5.1. Lower versions of vSphere/VI are supported, but vCenter must be up to date.

•At time of shipping, only vSphere 4.x and 5.x are supported.

•ONLY ESXi 5.0 and 5.1 will work for vSphere Replication as the VR Agent is a component of the ESXi 5.x hypervisor.

•While both Storage DRS and sVmotion are not supported with SRM 5.1, they will work in some scenarios even though unsupported.

•While Storage vMotion with array-replicated protected VMs can be done by an administrator, they must then ensure that the target datastore is replicated and that the virtual machine is once again configured for protection. Because this is a very manual process it is not officially supported.

•Storage DRS compounds this problem by automating storage vmotion, and thereby will cause the VMDK of the protected virtual machines to migrate to potentially un-protected storage. Because of this it is unsupported with SRM 5

•Storage vMotion and Storage DRS are not supported at all with SRM 5 using vSphere Replication as migration of a VMDK will cause the migrated VM to reconfigure itself for protection, potentially putting it in violation of its recovery point objective.