It’s been a while since I updated my blog. I’ve been busy lately with work and family so it’s really difficult to find time to update. However this post is something that I’ve wanted to share for a long while and a recent engagement with a client further fuels my wish to publish this.
For practitioner of VMware Virtualization, especially into BC/DR products, VMware Site Recovery Manager is certainly not something new and most of us would have played with it quite a bit. There are also quite a number of blogs on the gotcha and I’ve been referring to some of the blogs earlier on for solving some of my problems faced. VMware SRM is not a difficult product but it requires a bit of a product knowledge and experience when it comes to the Array based replication for VMware SRM. And I think some of us will tend to agree that some SRM error message displayed in VMware SRM console is not exactly the most helpful and sometimes even ambiguous. Logs deciphering is one of the key skills that any of us who does VMware SRM on a regular basis.
Recently I’ve my hands again on a VMware SRM project with Netapp Storage with NFS and this project brought me a lot of pain and further reinforce my belief on
“if you do not have a well-managed environment, try applying best common practices to help baseline your environment”.
Let me explain on my situation a bit and that will probably give you a better understanding what is going on. The client is intending to setup VMware SRM to help them with their BC/DR requirements. They have a 10G pipe across 2 sites (just beside each other) and multiple networks within each site. There is no stretched VLAN across sites but they have intention to so as to simplify their DR failover. One site has a perfect layout, networks segregated to their functions and devices kept within the networks eg Storage devices have dedicated storage VLAN etc. The other one is badly designed with a lot of cross network functions, which accordingly to them, are designed that way and they have no choice. However they also mentioned that there are no firewalls nor ACLs governing the route and the routes are pretty much on L3 switches on 10G. Alright it sounds good enough.
Their main storage on both sites are running on Netapp FAS and Vseries on NFS 10G. Pretty neat stuff. So I went on to discuss the requirements on their BC/DR and also run-through some of the pointers. One thing to note for them, I told them to follow some of the best practices dictated below.
One glaring point that I see was, on the second site, the storage lies on a dedicated VLAN whilst all the rest of the DR ESX storage VLAN rest on another VLAN. This obviously means there are routing to be done. I reflected this to the architect of project from client side and he assured me that it was a deliberate move as they have no control on the network and they have done the same for some other projects. He also assured me that there are no firewalls and communications between VLANs are good to go as well as intersite communications. So I took down the points and flag it as design limitations.
The story moved forward on the setup of SRM servers and SRA etc and it finally came to the point of establishing the array managers. Problems problems problems.
I’m able to do array pairing successfully in SRM and enabled successfully in SRM with 1 replicated NFS volume specified in Netapp SRA. However creation of Protection Group is not possible as no array pair is shown in the creation. After researching and looking at logs, I found the problem. The NFS datastores are mounted in hostnames rather than IP addresses and it seems that there are errors in resolving the hostnames in the environment.
Lesson Learnt 1: Wherever possible, use IP addresses to mount your NFS datastores especially if resolving hostnames is likely an issue in your environment
So moving forward, I succeeded in creating the necessary recovery plans and proceed for testing. Testing failed with all sorts of weird messages at different times. Error messages include: Address of storage not reachable, device is neither SAN or NAS type. etc
So again back to the logs and nothing really shows up on this so I start to look out for communication paths. I’ve attached a high level ping test diagram below and also did a telnet test on port 80, 22 and 443 for the storage subsystem within site from the SRM servers.
Aha! There are intermittent telnet losses on the management IP address of the Netapp storage via 10G ethernet ports. They did not enable the e0m interface as they thought that it’s easier to manage through the access ports instead. I did not really went through why there were intermittent telnet drops on port 80, 22, 443 from the SRM server to Netapp storage within site but this scenario only happens in that direction. I tried using other VLANs to do the same and everything went well. So I asked them to configure e0m interface on another VLAN so that SRM server can communicate to it. After this, everything went well.
Lesson Learnt 2: Always ensure your ports are communicating whenever you are hit with weird errors. Check your firewall and check on ports communications to see if the baseline is met properly
So all is well and I asked the project lead to extend the volume to a bigger size (client wanted to replicate more VMs) while I will come back another day to check again. Volume was extended and replication seems to be working fine from Netapp console. Alright good to go. Time to test run the recovery plans. Re-configured the Protection Groups and went ahead.
With the following error:
“SRA command ‘discoverDevices’ failed. SAN or NAS device not found
Ensure that the SAN device is configured and mapped to an igroup of ostype vmware and the NAS device is configured and exported with rw rules”
This is really terrible. According to the project lead, nothing was changed in both the environment and the setup. Back to looking at the logs and all seems fine with the standard errors until one of the discovery workflows triggered my curiosity. The discovery workflow typically search the storage for volumes and subsequently list the volumes and detects snapmirror volumes and relationship and reports. However for my case, it’s able to search for the snapmirrored volumes but it’s not able to find the mount point?!
So I went to the storage admin and check the export file. True enough there was an additional mount point / alias added. For example this is the usual line you will see:
But for my case I see this:
/vol/vol0_name /vol/vol0_name_production -ro=10.56.17/24,rw=10.56.17.5:10.56.17.6
Proceed to delete the unnecessary and test. Seems to work better but now my NFS datastore seems to fail at the end (gets stale or inactive) when it tries to mount my virtual machines. Well, consolation that at least I don’t get to see weird errors above. Various troubleshooting methods tried and I sort of concluded that the network is not stable. Mounts from other VLANs to the particular NFS datastore during inactive state shown in ESX is fine and mounted fast and browse fast. On my recovery plan test at that stage, the NFS datastore will become unresponsive and takes a long time to display contents. This problem is only seen during production hours and during rest hours, the recovery plan executes perfectly.
Lesson Learnt 3: Never deviate away from standard conventions configuration parameters for your product unless you know what is the impact.
After a long time of explanation and arguments, my client relents and reconfigure the ESX VMkernel targeted for Storage Access to the same VLAN as the Netapp Storage VLAN. Voila! All problems resolved.
Lesson Learnt 4: Apply your best practices when you do not know your environment. Especially so when you need to do cross solutions integration and it’s necessary to study and be fully aware of what the best practices can benefit / harm you.
I hope my little write-up today will help some one to troubleshoot their SRM implementation in a networked NFS environment. Cheers!