Tuesday 3 March 2015

Unable to Perform Test Recovery via SRM : Failed to create Lun snapshots

Issue Definition: Was unable to perform test recovery via SRM

- Checked via the Storage monitor and found the replication was working.
- Tried to increase the timeout for the Lun snapshot and performed the test recovery again : did not work.


Log Snippet: Dr.Logs

- Checked the SRM recovery log file and found the below error messages :
===================================================================

[2011-09-29 18:02:26.098 02276 trivia 'SecondarySanProvider'] testFailover's output:
D:/Program Files (x86)/VMware/VMware vCenter Site Recovery Manager/scripts/SAN/IBMSVC[ Thu Sep 29 18:02:22.825 testFailover Info ] SRA Version 1.20.71310
[#4] [ Thu Sep 29 18:02:22.857 testFailover Info ] Preconfig option setting ... false
[#4] [ Thu Sep 29 18:02:23.218 testFailover Error ] No valid WWPN found in the input...
[#4] [ Thu Sep 29 18:02:23.218 testFailover trivia ] Begin testFailover()
[#4] [ Thu Sep 29 18:02:23.224 testFailover trivia ] End testFailover()
[#4] [ Thu Sep 29 18:02:23.224 testFailover trivia ] testFailover::stop
[#4] [ Thu Sep 29 18:02:23.224 testFailover trivia ] Begin testFailover::getArrayTypeAndInfo()
[#4] [ Thu Sep 29 18:02:23.224 testFailover trivia ] testFailover::CIM_ObjectManager
[#4] [ Thu Sep 29 18:02:23.771 testFailover trivia ] Begin testFailover::deleteSVCFC()
[#4] [ Thu Sep 29 18:02:23.771 testFailover trivia ] Begin testFailover::getSVCSCS()
[#4] [ Thu Sep 29 18:02:23.771 testFailover trivia ] testFailover::IBMTSSVC_StorageConfigurationService
[#4] [ Thu Sep 29 18:02:23.802 testFailover trivia ] End testFailover::getSVCSCS()
[#4] [ Thu Sep 29 18:02:23.802 testFailover trivia ] testFailover::IBMTSSVC_RemoteStorageSynchronized
[#4] [ Thu Sep 29 18:02:24.843 testFailover trivia ] testFailover::IBMTSSVC_Cluster
[#4] [ Thu Sep 29 18:02:24.969 testFailover trivia ] testFailover::IBMTSSVC_LocalStorageSynchronized
[#4] [ Thu Sep 29 18:02:25.61 testFailover trivia ] testFailover::IBMTSSVC_SynchronizedSet
[#4] [ Thu Sep 29 18:02:25.204 testFailover trivia ] testFailover::IBMTSSVC_StorageVolume
[#4] [ Thu Sep 29 18:02:25.961 testFailover trivia ] testFailover::IBMTSSVC_HardwareIDStorageVolumeView
[#4] [ Thu Sep 29 18:02:26.58 testFailover Warning ] No snapshots to cleanup

[2011-09-29 18:02:26.099 02276 info 'SecondarySanProvider'] testFailover exited with exit code 0
[2011-09-29 18:02:26.099 02276 trivia 'SecondarySanProvider'] 'testFailover' returned <?xml version="1.0" encoding="ISO-8859-1"?>
[#4] <Response>
[#4] <ReturnCode>0</ReturnCode>
[#4] </Response>
[#4]
[2011-09-29 18:02:26.099 02276 info 'SecondarySanProvider'] Return code for testFailover: 0
[2011-09-29 18:02:26.099 02276 verbose 'SecondarySanProvider'] Removed dependent 'shadow-group-171743' from array 'array-56521'
[2011-09-29 18:02:26.100 02276 trivia 'SecondarySanProvider'] 'Prepare storage for group 'shadow-group-171743' for recovery' took 10.579 seconds
[2011-09-29 18:02:26.100 02276 verbose 'PropertyProvider'] RecordOp ASSIGN: info.error, BeginImageTest-400
[2011-09-29 18:02:26.100 02276 verbose 'BeginImageTest-Task'] Error set to (dr.san.fault.ExecutionError) {
[#4] dynamicType = <unset>,
[#4] faultCause = (vmodl.MethodFault) null,
[#4] errorMessage = "Failed to create lun snapshots",
[#4] msg = "",
[#4] }

Checks Performed: 
===================


- Error was related to the Zoning : No valid WWPN found in the input.
- Next step to check the Zoning
- Found multiple errors on the Storage monitor.

Action Plan :
===========
- Would be to take a manual snapshot of the consistency group.


**After checking that the manual snapshot of the consistency group did not work and involved storage vendor for further assistance.

Tuesday 17 February 2015

Test recovery with EMC Recovery Point SRA and SRM 5.0 fails with the error: Failed to recover the datastore

Symptoms : ========= Attempting to perform Test Failover fails with the error: Error - Failed to recover datastore 'XX_HDXX_LUNXX_T4'. VMFS volume residing on recovered devices "5431", "5432", "5433" and expected to be auto-mounted during HBA rescan cannot be found Started the replication via SRM the lun got successfully mounted. Checked and found that the problem was with EMC replication. The dr-vmware.log file showed the error Failed to recover datastore 'VNX-MP-LUN07-TIER3'. VMFS volume residing on recovered devices '"60:06:01:60:07:E0:22:00:20:DD:9E:42:CF:0F:E1:11"' and expected to be auto-mounted during HBA rescan cannot be found

Logs : 

=======
2012-01-06T23:06:54.263Z cpu14:114970)WARNING: HBX: 1889: Failed to initialize VMFS3 distributed locking on volume 4ef4e35b-1442b43c-038c-0025b5000c19: No 2012-01-06T23:06:54.271Z cpu14:114970)FSS: 4333: No FS driver claimed device 'snap-50d5bf4f-4ef4e359-709ff89d-a083-0025b5000c19': Not supported 2012-01-06T23:06:54.274Z cpu14:114970)VC: 1449: Device rescan time 38 msec (total number of devices 12) 2012-01-06T23:06:54.274Z cpu14:114970)VC: 1452: Filesystem probe time 64 msec (devices probed 9 of 12) 2012-01-06T23:06:54.292Z cpu11:2059)ScsiDeviceIO: 2316: Cmd(0x4124415c3540) 0x9e, CmdSN 0xcbc2 to dev "naa.50060160bce010a750060160bce010a7" failed H:0x0 2012-01-06T23:06:54.292Z cpu11:2059)ScsiDeviceIO: 2316: Cmd(0x4124415c3540) 0x25, CmdSN 0xcbc3 to dev "naa.50060160bce010a750060160bce010a7" failed H:0x0 2012-01-06T23:06:54.294Z cpu11:2059)ScsiDeviceIO: 2316: Cmd(0x4124415c3540) 0x28, CmdSN 0xcbcb to dev "naa.6006016007e02200608991b2ad2de111" failed H:0x0 2012-01-06T23:06:54.294Z cpu14:114970)Partition: 484: Read of GPT header failed on "naa.6006016007e02200608991b2ad2de111": I/O error 2012-01-06T23:06:54.294Z cpu11:2059)ScsiDeviceIO: 2316: Cmd(0x4124415c3540) 0x28, CmdSN 0xcbcc to dev "naa.6006016007e02200608991b2ad2de111" failed H:0x0 2012-01-06T23:06:54.294Z cpu14:114970)WARNING: Partition: 944: Partition table read from device naa.6006016007e02200608991b2ad2de111 failed: I/O error 2012-01-06T23:06:54.345Z cpu14:114970)HBX: 676: Setting pulse [HB state abcdef02 offset 3764224 gen 1 stampUS 177206183150 uuid 4f04ca84-1f05f6a1-c310-d8d 2012-01-06T23:06:54.345Z cpu14:114970)WARNING: FSAts: 1263: Denying reservation access on an ATS-only vol 'VNX-HP-LUN01-TIER2' 2012-01-06T23:06:54.345Z cpu14:114970)WARNING: HBX: 1889: Failed to initialize VMFS3 distributed locking on volume 4ef4e35b-1442b43c-038c-0025b5000c19: No 2012-01-06T23:06:54.352Z cpu14:114970)FSS: 4333: No FS driver claimed device 'snap-50d5bf4f-4ef4e359-709ff89d-a083-0025b5000c19': Not supported 2012-01-06T23:06:54.376Z cpu14:114970)Vol3: 647: Couldn't read volume header from control: Invalid handle 2012-01-06T23:06:54.376Z cpu14:114970)FSS: 4333: No FS driver claimed device 'control': Not supported 2012-01-06T23:06:54.396Z cpu14:114970)VC: 1449: Device rescan time 41 msec (total number of devices 12) Resolution :
============ Our initial workaround was to disable the single VAAI Primitive VMFS3.Hardware.AcceleratedLocking on the hosts. From EMC Had to change the affected hosts from Failover Mode 1 to Failover Mode 4 (within the Unisphere Storage System Connectivity Status menu). Once the hosts were changed to Failover Mode 4, we re-enabled the VAAI Primitive and no longer encountered the issue.

Unable to power on/off the VM Error : msg.checkpoint.cpufeaturecheck.fail

There are many situations where in we have encountered issues of unable to power of the virtual machine. Below is one of the troubleshooting step which helped resolving the problem with the help extensive/elaborate logging of VMware Logs with which we identify the problem.

Symptoms : ========== Unable to power on/off the virtual machine Unable to revert to the previous snapshot Unable to unregister/register the host to different host. Purpose : ========= To power on the virtual machine Cause :
VMware.logs: ============= vmx| [msg.checkpoint.cpufeaturecheck.fail] The features supported by the processor(s) in this machine are different from the features supported by the processor(s) in the machine on which the checkpoint was saved. Please try to resume the snapshot on a machine where the processors have the same features
Resolution : ============ Uncommented the checkpoints in the vmx file and was successfully able to power on the virtual machine.

Vmotion stucks at In progress status & Unable to connect to the virtual center service + tomcat service utilizing high memory

So In this particular troubleshooting we were facing issues performing VMotion of a VM, In the course of troubleshooting we identified that though the problem was with the database size, the was an underlying issue with the tomcat service which lead to the problem of vmotion.

Symptoms / Troubleshooting Performed :
==============================
    Unable to connect to the Virtual Center / Unable to perform Vmotion
    Tried to truncate the database. Followed following KB : 1003980
    Tried to stop the service for Virtual center : Unable to stop the service.
    Tried to reboot the virtual center server : successfull.
    Unable to start the virtual center service.
    Tried to check the size of the database = 50GB
    Found that the Microsoft SQL has exceeded the maximum limit.
    Followed following Articles : 1007453, 1000125- Got 525 GB of free space.
    Was successfully able to connect the virtual center.

Purpose :
=======
     Was unable to perform Vmotion.
     Checked the task manager an found that tomat service was utilizing the high memory.

Cause :
======
    Tomat service was utilizing the high memory.

 Resolution :
=========
     Increased the memory of tomcat service from 256MB to 1024 and was successfully able to perform VMotion. 

Unable to Ping the Virtual machine "Could not find the file specified" While starting the IPSEC services"

Symptoms :
========== Unable to ping the Virtual machine from another VM in the same subnet. Unable to ping the default gateway. Purpose :
========== To be able to connect to the Network. To be able to connect to the domain Network. Cause :
======== - Checked the Eventviewer Logs and found that we were getting error messages related to IPSEC service. - Checked the services.msc and found that the IPSEC service was not started. - Error : "Could not find the file specified" While starting the IPSEC services.
Resolution : ===========
  • To resolve this issue, followed these steps:
  • Rebuild a new local policy store. To do this, Click Start, click Run, type regsvr32 polstore.dll in the Open box, and then click OK.
  • Verify that the IPSEC Services component is set to automatic, and then restart the domain controller.
  • Restarted the Virtual machine and was able to connect to the network.
More Information : 
===================
Followed following Knowledgebase : http://support.microsoft.com/kb/912023

Error "failed: Unable to create a VSS snapshot of the source volume. Error code 2147754774 (0x80042316)"

Symptoms :
==========
  • Error "failed: Unable to create a VSS snapshot of the source volume. Error code 2147754774(0x80042316)"
  • Convertor Logs shows following error : "a general system error occurred.Found dangling SSL error"
  • Conversion fails at 1%
Purpose :
=========
  •     To be able to convert from Physical Machine to Virtual Machine.
  •     To be able to convert from Virtual Machine to Virtual Machine.
Resolution :
============
  • Placed the computer in clean boot state, disabled firewall, tried the conversion process again same issue.
  • Tried to perform the cold cloning by attaching the virtual machine cd drive to the standalone convertor : Error no network driver found.
  • Checked and found that the problem was related to the VSS service.
  • Restarted the VSS service : same issue
  • Tried to perform backup through Net backup utility : unable to take back up.
  • Found following knowledgebase from the Microsoft website to fix the VSS : http://support.microsoft.com/kb/940184
  • Installed and ran fixit , and was successfully able to start and complete the conversion process.

Duplicate IP error on a VM Unable to connect to Network

Symptoms :
=========
  •     Unable to connect the Virtual machine to the Network
Purpose :
=======
  •     To get the Virtual machine to the network.
Cause :
======
  •     Only Windows 2008 machines were getting the duplicate IP errors.
  •     Tried resetting TCP/IP stack.
  •     Tried providing different static IP : same problem
Resolution :
=========
When this problem occurs, the ProxyArp device responds to all ARP requests.
To work around this problem, we can turn off gratuitous ARP by setting the value of the ARPRetryCount registry entry to 0. To do this, follow these steps.

1.  Click Start , type regedit in the Start Search box, and then press ENTER.
2.  Locate the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
3.  On the Edit menu, point to New , and then click DWORD Value .
4.  Type ArpRetryCount .
5.  Right-click the ArpRetryCount registry entry, and then click Modify .
6.  In the Value data box, type 0 , and then click OK .
7.  Exit Registry Editor.
8. Rebooted the machine

See below link :
http://social.technet.microsoft.com/Forums/en-US/windowsserver2008r2networking/thread/d7bda315-6366-4e0a-bdcf-dc875ff6963e