Wednesday, March 12, 2014

DPM 2012 and Beyond Frustration

All of our Hyper-V Clusters, Server 2008 R2 hosts, started having failed backups inside our two independent Data Protection Managers. The problem initially progressed from one node consistently fail backups for virtual machines and the other hosts kept performing backups, until all of our nodes could no longer could make successfully backups of any virtual machines. Our standalone backups via DPM had no issue. These hosts had been configured and unchanged for well over a year - only Windows patches months prior and anti-virus updates were continuously loading.

DPM kept stating for the failed backups that "The VSS application writer or the VSS provider is in a bad state ... ID 30111: VssError:A function call was made when the object was in an incorrect state for that function(0x80042301)) and the local nodes wrote VSS 12362 Application Log Event Errors "A Shadow Copy LUN was not detected in the system and did not arrive" and VSS 12363 Application Log Event Errors "An expected hidden volume arrival did not complete because this LUN was not detected" whenever we attempted to run full virtual machine backup via a Consistency check.

We had tried and didn't work...
  • Power cycling all of the equipment involved: Hyper-V Servers (PowerEdge R710's), the iSCSI SAN (EqualLogic PS4000vx's), the switches connecting them (Catalyst 3750X's), and our DPM server
  • Unregistering and Registering the EqualLogic VSS provider (eqlvss /unregserver and eqlvss /regserver)
  • Removing virtual machines from a protection group (deleting disk data) and adding them back
  • Moving virtual machines to a new protection group
  • Upgrading the EqualLogic Windows Host Integration Toolkits (HIT kits) on the Hyper-V nodes - upgraded from 4.0 to 4.6
  • Installing the EqualLogic HIT kit on one of the virtual machines
  • Patching the Hyper-V nodes to all of the latest Windows Updates - even yesterdays released kb 2908783 which resolves issues with corruption of iSCSI LUNs in Windows Server 2008 R2 and 2012
... and still no success.

After much time wasted on what seemed to be magic potions and DPM's hatred of backing up critical data, a random thought of trying to disable our anti-virus on the cluster nodes resolved the issue! Yeah, I know they say to disable anti-virus on everything and everywhere you read, but we have had Microsoft Forefront Client Security on these systems configured and running since we setup these servers 2+ years ago. Apparently, some change in the definitions or just its mood decided to start messing with the iSCSI VSS Hardware process... and messing with my sleep over the last two days.

Good luck!


 

 

1 comment:

Anonymous said...

Hello
Same error message and symptom , NO AV on our production server though so cannot see a way out this