Recently, a customer of ours experienced an issue while moving mailboxes to Office 365 in a Hybrid Configuration. After the move request was generated, they would end up with a status “StalledDueToMailboxLock” after a short while. The move request would then stay in that state for an indefinite amount of time.
Although I wasn’t able to test everything myself, first hand, a certain mailbox move was reported to proceed after been in the “StalledDueToMailboxLock” for about 2 days. Obviously, this isn’t the pace you’d want your mailboxes to be moved to Office 365 and some further investigation is needed.
Before any “deep” troubleshooting was done, we checked if there wasn’t something that could cause a mailbox lock, maybe a process that was blocking access to the mailbox like a misconfigured Anti-Virus. However, we soon found out that everything seemed normal…
Whenever a move requests fails or seems to be hanging in a specific state for a while, it’s always a good idea to take a look at the Move Request Statistics using “Get-MoveRequestStatistics”:
Get-MoveRequest <Identity> | Get-MoveRequestStatistics –IncludeReport | fl
The command above will immediately output the results on-screen. However, for requests that have been running for a while, the amount of information from the command is too much to handle through the console. As an alternative, you can either output it to e.g. a text file using Out-File or – and in my opinion this is much easier to handle – export it to an XML file using the Export-Clixml command:
Get-MoveRequest <Identity> | Get-MoveRequestStatistics –IncludeReport | Export-Clixml c:\moverequeststats.xml
Now, you can open that XML file in a more specialized XML-reader (or just Internet Explorer if you want) and go through its content.
In this particular case, the problem was caused by Exchange Web Services. Somewhere in the XML file (and multiple times), you’d find the following reference:
<S N=”Message”>Communication with remote service ‘https://autodiscover.company.be/EWS/mrsproxy.svc CAS1.fqdn (126.96.36.199 caps:05FFFF)’ has failed. –> The call to ‘https://autodiscover.company.be/EWS/mrsproxy.svc CAS1.fqdn (188.8.131.52 caps:05FFFF)’ failed. Error details: The remote endpoint no longer recognizes this sequence. This is most likely due to an abort on the remote endpoint. The value of wsrm:Identifier is not a known Sequence identifier. The reliable session was faulted..
Similarly, we also found the exact same reference, but for another CAS:
<S N=”Message”>Communication with remote service ‘https://autodiscover.company.be/EWS/mrsproxy.svc CAS2.fqdn (184.108.40.206 caps:05FFFF)’ has failed. –> The call to ‘https://autodiscover.company.be/EWS/mrsproxy.svc CAS2.fqdn (220.127.116.11 caps:05FFFF)’ failed. Error details: The remote endpoint no longer recognizes this sequence. This is most likely due to an abort on the remote endpoint. The value of wsrm:Identifier is not a known Sequence identifier. The reliable session was faulted..
This points out that the move request was trying to connect over different Client Access Servers. However, when keeping in mind the load balancing requirements for Exchange 2010, it’s clearly stated that using EWS without affinity is not supported:
Exchange Web Services Only a subset of Exchange Web Services requires affinity. Availability Service requests don’t require affinity, but subscriptions do. All aspects of Exchange Web Services experience performance enhancements from affinity. An affinity timeout value of 45 minutes is recommended for Exchange Web Services clients to ensure that periodic polling for events associated with a subscription are not directed to a new Client Access server resulting in inefficient new subscriptions for each request. We don’t support the use of Exchange Web Services without affinity.
Edit note: as you might’ve guessed, this problem happened because on-premises an Exchange 2010 hybrid server was used. If that would’ve been an Exchange 2013 environment, this wouldn’t have happened as there’s no need to keep affinity for EWS.
So, because of this, we took another look at the load balancer configuration. To confirm whether or not it was the load balancer causing the issue, we temporarily bypassed them and connected directly to one of the CAS servers after which we found out that the move requests processed successfully.
The efforts now turned back to the Load Balancer and it was found out that the affinity for EWS, which was configured in the first place, wasn’t applied correctly. After correcting the settings in the Load Balancers everything worked as expected.
I would like to thank William Rall (MSFT) for his efforts and time. He quickly pointed out the root cause, which helped us determine what caused the problem. His feedback also allowed me to dive deeper into the XML and find what I was looking for. Admitted: diving into the XML without actually knowing what to look for, isn’t always easy!