Troubleshooting Azure Site Recovery (ASR) - Data Replication Not Working
@20aman Nov 14, 2016If you have the Azure Site Recovery (ASR) setup in your environment and are facing the issue where the data replication is stuck, then follow this blog to troubleshoot. The data replication can be stuck either during the initial replication or during the delta changes. This can occur for various reasons. We will inspect various components involved in ASR. Major of the troubleshooting is done on the Management Server i.e. the on-premise Configuration/Master target server.
1. Check Alerts Details
Go to the Azure Site Recovery Vault and navigate to the settings. Click on the Alerts and Events. Check the alerts for the data replication being blocked etc. Verify that the problem is related to data replication and not something else.
You can also navigate to the "Replicated Items" in the ASR Vault settings. On the blade for replicated items, click on the server for which the data is not being replicated. A new blade will open for this server's properties Then click on the Error Details on the server properties blade's context menu (which you can access by clicking on the top right ellipse i.e. 3 dots).
After verifying the issue, proceed to next sections to troubleshoot.
2. Check Resource Monitor
Check if you see any activity in the Resource Monitor. This is also to validate if the issue is there or not. Sometimes the Low Bandwidth and multiple servers configured against one Management server can cause this issue. Ensure that this is not the scenario in your case.
From the Task manager, go to performance view and check for the bandwidth consumption. Then click on the "Open Resource Monitor" button to launch the Resource Monitor. From the CPU section in the Overview tab, select the below two services:
- cxps.exe
- cbengine.exe
Then click on the Network tab and see if there is any traffic going out to Azure. If the data transfer is going on without issues then you should be able to view entries against cbengine going out to a URL which will look something like "blob.aaa1aaa1aa.core.windows.net" and entries against the csps service .
3. Check ASR Infrastructure Setup
First of all, check if the ASR infrastructure setup is correct and nothing is wrong there. To view this, navigate to the ASR Vault. Go to the settings and click on the "Site Recovery Infrastructure". In the next blade, click on the kind of infrastructure you have setup. E.g. If you are replicating from VMWare or Physical Machines from on-premise to Azure then click on the "Configuration Servers" under the "For VMWare & Physical Machines" section.
Here check if the Config Server is showing as "Connected". If not then the problem is in the communication between Configuration Server and the Azure. Ensure that you are able to connect to the Azure portal from the config server. Also, ensure that all the public URLs for Azure are accessible. Check this link for exact URLs: Verify URL Access.
Next, click on the configuration server. This will open another blade with details for the configuration server. Expand the section for "Associated Servers" as marked no. 2 in the screenshot below. Check if all the associated servers, i.e. Process Server, vCenter Server and Master Target servers are connected and showing green tick mark.
Next, check the configuration server health as shown at no. 3 below. Check if all the services are running and showing healthy. Ensure that you have sufficient free space on the configuration server to send the replication data. If you see any services not running then go to the next section to check and start the services on the Management Server on-premise.
You can try refreshing the server after making any configuration changes on it, e.g. increasing memory or freeing up disk space. Click on the "Refresh Server" button as shown at no. 4, at the top of the blade for the configuration server.
4. Checking Services on the Management Server
Check if the services on the Management Server are up and running. You need to check for the below services:
- InMage PushInstall
- InMage Scout Application Service
- InMage Scout VX Agent - Sentinel/Outpost
- INMAGE-AppScheduler
- Microsoft Azure Recovery Services Agent
- Microsoft Azure Site Recovery Service
- cxprocessserver (This is important service. It is the service for the InMage CX Process Server)
- tmansvc (This is the service for the InMage Volsync Thread Manager Service)
Start any service which is not running and check if the problem still exists. 90% of the time the problem is going to be because of something related to these services (e.g. a restart or patch stopped one of these services).
5. Checking Services on the Server being replicated
Check if the services on the Server being replicated are up and running. You need to check for the below services:
- Azure Site Recovery VSS Provider
- InMage Scout Application Service
- InMage Scout VX Agent - Sentinel/Outpost
Start any service which is not running and check if the problem still exists.
6. Verify Service Account credentials are correct and have required access
The replication can stop if the service account is not correct or it doesn't have required access. Check if the service account's password expired or changed.
You can use the Configuration Server config tool to check and update the service accounts. This tool can be accessed from this directory path: "D:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems\bin" where D is your install directory for ASR setup. The tool name under this directory is "cspsconfigtool.exe".
7. Check Logs
There are various ASR logs that gets generated in the Management server. Two key logs that you should check are as shown below. This assumes that D is the directory where ASR is installed.
- Monitoring Logs - These logs are located at "D:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems\var". Name of the file you should check is "monitor_ps".
- VM-Specific ASR Logs - These logs are located at "D:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems". Then there will be a folder with the name as a GUID for each VM. Navigate to the folder with the Guid and try to find the folder for your VM's GUID. One indication will be the number of disks and the disk sizes. Once you have located the folder for one of the VM having replication problems. Then navigate to internal folders and locate the perf.log file for your VM's disks. Check to see if there are any errors here.
These logs should give you an idea as to what may have been causing the issues.
In Conclusion
After all these steps and any changes you should Refresh the Configuration Server as shown in the point 3 above.
Let me know if this blog helped in your scenario.