Hey all,
I'm using DPM 2012 R2 with UR5 backing up a SharePoint 2013 farm. I have the DPM agent installed on one web server in the farm (there are 2 web servers, 3 app servers) and the SQL backend. When I created the protection group, DPM was able to take the first
backup without issues. However, every night, we get alerts from SCOM about how a backup for a particular database failed on the SQL box. Sometimes when I check DPM, I do see the replica is inconsistent, but sometimes it's just fine.
Lately, however, we are having far more critical issues. Almost every night, DPM is saying the replica is inconsistent and the reasons why is this:
Change Tracking has been marked inconsistent due to one of the following reasons
1. Unexpected shutdown of the protected server
2. Unforeseen issue in DPM Bitmap failover during cluster failover of one or more datasources sharing the tracked volume. (ID 30501 Details: Unknown error (0xe0062040) (0xE0062040))
I verified neither of the servers is crashing. Not sure what to check for #2. And on the SQL server, we see many VSS timeouts, errors about shadowcopy databases left mounted, write and flush timeouts on the data volume, errors like:
BackupVirtualDeviceFile::SendFileInfoBegin: failure on backup device '{97BCAB2B-4637-441D-B686-A39206584141}1'. Operating system error 995(The I/O operation has been aborted
because of either a thread exit or an application request.)
Volume Shadow Copy Service error: The I/O writes cannot be held during the shadow copy creation period on volume\\?\Volume{a4828730-fb21-4b94-893a-9b62c2cfb3e7}\. The volume index in the shadow copy set is 0. Error details: Open[0x00000000, The operation completed successfully.
], Flush[0x00000000, The operation completed successfully.
], Release[0x80042314, The shadow copy provider timed out while holding writes to the volume being shadow copied. This is probably due to excessive activity on the
volume by an application or a system service. Try again later when activity on the volume is reduced.
], OnRun[0x00000000, The operation completed successfully.
].
Operation:
Executing Asynchronous Operation
Context:
Current State: DoSnapshotSet
Volume Shadow Copy Service error: The shadow copy could not be committed - operation timed out. Error context: DeviceIoControl(\\?\Volume{a4828730-fb21-4b94-893a-9b62c2cfb3e7}
- 0000000000000344,0x0053c010,000000F33CBF1E00,0,000000F33CBF3E20,4096,[0]).
Operation:
Committing shadow copies
Context:
Execution Context: System Provider
So I'm trying to figure out if I'm looking at disk performance issues here or if it could be something else causing the failures? I have our storage team checking disk activity for when DPM runs but I figure I'd also reach out here and see what others think.
Thanks in advance,
Aaron