Log in



Managing Split Brain in Exchange 2010 DAG with Datacenter Activation Coordination Mode

Wednesday, November 18th, 2009 by

While in my Exchange 2010 ignite class we came across a new feature of DAG called Database Availability Groups (DAG).  DAG is a great way to provide high availability and redundancy in an Exchange 2010 environment.  DAG’s are basically replacing the Exchange 2007 features known as LCR, CCR, SCR, and SCC. 

One consideration for leveraging a DAG is placing mailbox servers in different datacenters and replicate the data over the wire.  This can be accomplished with DAG and was accomplished in Exchange 2007 using a geo-distributed CCR setup.  One concern however is a split brain occurrence.  Say for example you have two datacenters in your organization.  Datacenter A has 2 nodes of the DAG plus the File Share Witness (FSW) and datacenter B has two DAG server nodes.  If the primary datacenter (Datacenter A) should happen to lose power and the DAG is activated in Datacenter B those two servers now are primary.  However, when the primary datacenter is restored, Datacenter A, and say for example the network between the two sites has not been restored, this is then potential for a split brain.  This is because when Datacenter A comes back on line it sees the FSW and has 3 votes for quorum.  Two from the DAG and one from the FSW.  Datacenter B believes it is in charge and remains active.  Now both datacenters believe they are authoritative for the DAG. 

In order to remedy this problem in Exchange 2010 a new feature has been developed called Datacenter Activation Coordination (DAC).  DAC is used to control the activation behavior of DAG nodes that may be split between multiple datacenters.  Basically what occurs here is that when there is an outage in a datacenter other members of the DAG will come on line in another datacenter.  When the DAG nodes that are offline return to service the offline DAG nodes will leverage a protocol called Datacenter Activation Coordination Protocol (DACP) before trying to mount their databases.  The DACP is used to determine the current state of the DAG and whether Active Manager should try to mount the databases or not. 

Now you may be wonder, what is Active Manager?  Well, Active Manager stores a bit in memory (either a 0 or a 1) that tells the DAG whether it’s allowed to mount local databases that are assigned as active on the server. When a DAG is running in DAC mode (which would be any DAG with three or more members), each time Active Manager starts up the bit is set to 0, meaning it isn’t allowed to mount databases. Because it’s in DAC mode, the server must try to communicate with all other members of the DAG that it knows to get another DAG member to give it an answer as to whether it can mount local databases that are assigned as active to it. The answer comes in the form of the bit setting for other Active Managers in the DAG. If another server responds that its bit is set to 0, it means servers are allowed to mount databases, so the server starting up sets its bit to 1 and mounts its databases.

So, what this means that if you recover from a failure in the datacenter the DAG nodes must communicate with all other Nodes in the DAG that it is aware of and verify if the databases on that DAG node can be mounted since they all have a DACP bit value of 0.  Once they can verify that no other databases are mounted (setting of 1) then those databases will mount and set their bit to 1. 

Make sense?  I think this is a pretty impressive solution that MS has come up with to prevent the split brain in Exchange 2010.  The kicker?  DAC is disabled by default.  Keep in mind that in order to leverage DAC you need to have at least a 3 node DAG in different datacenters.  I suppose you wouldn’t need this if they are all in the same datacenter and the nodes can communicate with each other. ;)  

If you are looking at deploying a DAG across multiple datacenters you will want to enable DAC.  In order to Enable DAC you can run the following command:

Set-DatabaseAvailabilityGroup –Identity DAGID –DatacenterActivationMode DagOnly

For more information on the ‘Set-DatabaseAvailabilityGroup’ you can go here.

EDIT: This feature will be updated in Exchange 2010 SP 1.  For more information please read my article “Datacenter Activation Coordinator Changes in Exchange 2010 SP1!

8 Responses to “Managing Split Brain in Exchange 2010 DAG with Datacenter Activation Coordination Mode”

  1. Craig says:

    Nice blog, looking into a multi-site DAG at the moment so find this rather handy!

  2. [...] Create and configure the Database Availability Group (DAG) You Had Me At EHLO… : Video: High Availability in Exchange Server 2010 – Part 4 High Availability and Storage in Exchange 2010 Deploying High Availability and Site Resilience: Exchange 2010 Help Managing Database Availability Groups: Exchange 2010 Help High Availability Cmdlets: Exchange 2010 Help Managing Split Brain in Exchange 2010 DAG with Datacenter Activation Coordination Mode – Scott… [...]

  3. G says:

    I agree with Craig, a very nice blog. I’m in the process of designing out exchange 2010 infrastructure across the org. We have several sites across the world and plan to be implimenting a multi-site DAG.

    Thanks for sharing

    G

  4. Rob says:

    I think you have an error in your text. I know the article is a little dated, but in case others are looking for information on it.

    You have stated:
    “If another server responds that its bit is set to 0, it means servers are allowed to mount databases, so the server starting up sets its bit to 1 and mounts its databases.”

    The Technet (http://technet.microsoft.com/en-us/library/dd979790.aspx) article states:
    If another server responds that its bit is set to 1, it means servers are allowed to mount databases, so the server starting up sets its bit to 1 and mounts its databases.

    I believe the Technet article has it correct as it would logically make more sense.

    Thanks!

    Rob

  5. Scott says:

    Rob, I will look into this howerver looking at the article when the server starts up and the bit is set to 0 it means it cannot mount the database. 0 in this case meaning now.

    Later in the article it states that if another server responds with a “1″ then it is allowed to mount the database and set it’s bit to 1, which doesn’t make sense considering from my understanding the 1 means it is mounted.

    I’m going to email a buddy of mine and see how he translates this.

    EDIT:

    Rob, I believe there is a typo, 0 means database not mounted which is why each database will start with thier bit to 0 until it can determine if it can mount. It will perform a lookup in the environment trying to find if that databases is mounted anywhere else. If all servers return a 0 then the requesting AM can set its bit to 1 and mount the database.

  6. Mark says:

    Hi – well written article. I am curious as to how this works with an ACTIVE/ACTIVE scenario and an equal number of servers.
    ie,
    4,000 users at Site A spread across 2 servers, with 1A/1P at Site A and 1P at SiteB.
    4,000 users at Site B spread across 2 servers, with 1A/1P at Site B and 1P at SiteA.

    If the WAN drops between the two sites – what happens?
    Mark

  7. Scott says:

    Hi Mark,

    This is a Design question. :) Given your current configuration you have to look at the number of votes in your DAG. Based on your design you have two servers in the primary Site and one server in a secondary site. This tells me you have three votes in your DAG to maintain Quorum. In the event of a WAN dropping the primary site will remain on line while the secondary site will go off line. Users at the Secondary site will then no longer have access to mail.

    If you are in a DR situation then you would need to perform a site switch over but Stopping the DAG and evicting the two servers at the down site out of the DAG.

    Let me know if this answers your question.

    Thanks.

Leave a Reply