Mongodb DR Strategies  - Part 1

Mongodb DR Strategies - Part 1

This article discusses strategies for Disaster Recovery (DR) for a MongoDB cluster, focusing on data replication and ensuring DR servers can take over during a disaster.The article evaluates various deployment models, their behaviours in failure scenarios, and provides a detailed overview of setting up those and handling failovers, and ensuring smooth DR operations.

Approaches we are exploring and testing as part of this POC:

  1. Mongodb Multi-DC Replicaset members - Add DR Servers as Replicaset members of current production cluster. (Part-1)

  2. Mongoshake - An Opensource Mongodb sync tool developed by Alibabacloud (Part-2)

  3. Mongosync - Is a tool developed by mongodb used for syncing data between different MongoDB clusters or databases. It’s particularly useful for real-time replication, data migration, and disaster recovery scenarios (Part-2)

How does a typical MongoDB replicaset cluster look?

For maintaining redundancy and high availability, MongoDB is set up as a 3-node replica set: 1 PRIMARY and 2 SECONDARY nodes. The primary receives all write operations, and the secondaries replicate operations from the primary to maintain an identical data set. All members of the replica set can accept read operations. However, by default, an application directs its read operations to the primary member. A secondary can become a primary if the current primary becomes unavailable. In such cases, the replica set holds an election to choose which of the secondaries becomes the new primary.

Please refer this to understand more about the working - https://docs.mongodb.com/manual/core/replica-set-secondary/

Let's refresh a few essential facts about a MongoDB cluster before we continue the discussion.

Oplog - Just like postgress WAL Logs , Its keeps a rolling records of all operations which modify the data stored on mongodb

Types of nodes on Mongodb replicaset cluster

Maximum 50 members of replica set [max 7 voting members]

  1. Arbiter Nodes - Not holding data only for Voting.

  2. Priority 0 Nodes [0-1000] - never pri , but App can use.

  3. Hidden Nodes - Never become primary , 0 Priority -backup job & reporting.

  4. Delayed Node - 0 Priority , Votes - Backup.

Important configurations

Some importent facts about the mongod replicaset

  1. Maximum voting members on on Replicaset - 7 [Maximum members - 50]

  2. If primary node not able to ping major number of cluster nodes , then it consider as network partitioned & it will step back and become SECONDARY.

  3. Old primary will not become primary until next failover there is no auto failback functions.

  4. Read & write will happen from Primary only *Default

  5. You cannot have multiple masters, Mongodb provides a master – slave replication only, where all writes occur on one host and are replicated to the other read only secondary host.

  6. Fault tolerance table - https://docs.mongodb.com/manual/core/replica-set-architectures/

Approach 1)

Mongodb Multi-DC Replicaset members

In this method we are trying to Add DR Mongodb servers as Replicaset members of our Production Replicates cluster, Data Replication and Automatic failover will take takecare by Mongodb replicaset. Below are the different deployment model's and the observation/findings in different failure scenarios.

this can be achieved with multiple patterns and lets try each and understand how it behave incase of outages:

Pattern-1)

3 node's in Production & 2 Nodes in DR [2 Secondary]

  • **Scenario 1 :**Any of the secondary nodes goes down in DR infra/Prod infra - No impact Primary will still serve traffic

  • Scenario 2 : Production Master goes down - Wait 10s and start election & select any of the Secondary Node from Production environment as Master.
    (why not elect from DR nodes? - All DR nodes Priority given as 0.5 so as per Election algorithm mongodb always prefer high priority nodes. i.e it will choose any of the nodes from production infra [Priority 1] , in this way we can achieve Active - Passive kind of set up , here DR node will only become Primary if all Production nodes goes down. )

  • Scenario 3 : Production DC is goes down - DR Nodes cannot start election as its not able to achieve Quorum ----→ [Downtime] i.e No write nodes are available. - But we can reconfigure replica set & make one node as master. [Manual Failover required]

  • Scenario 4 : DR DC is goes down - No impact, Production Servers has majority of nodes so it can maintain the Quorum.

Pattern-2)

3 Node in Production & 3 Node in DR + 1 Arbiter in 3rd DC [Arbiter]

  • Scenario 1 : Any of the secondary nodes goes down in DR infra/Prod infra - No impact Primary will still serve traffic

  • Scenario 2 : Production Master goes down.
    Trigger election & select any of the Secondary Node from Production environment as Master.

  • Scenario 3 : Production DC is goes down
    We have majority of Nodes in DR infra i.e 3 Sec + 1 Arbiter from 3rd DC So it will start election & Select one server as master.

  • Scenario 4 : Production up & back in live.
    New election will not happen Automatically & Need to do a manual failover by trigger the election by stopping the Master node in DR - [ Election will always select highest Priory nodes from Prod Infra]. (** Health check and Automatic failback can be achieve using scripts.)

  • Scenario 5 : Network connectivity is issue /Tunnel Down [ Prod DC -→ DR ] / DR DC goes down, Prod Master able to maintain quorum with the help of 3rd DC Arbiter so no impact here.
    [No-data loss] [No Manual failover required] [No Downtime] [DR DC failure will not make any impact on production]

Pattern-3)

3 Node in Production [1 PRI & 2 SEC ] & 3 Nodes in DR[3 Secondary Nodes]

  • **Scenario 1 :**Any of the secondary nodes goes down in DR infra/Prod infra - No impact Primary will still serve traffic.

  • Scenario 2 : Production Master goes down. -
    Wait 10s(heartbeat timeout) and start election & select any of the Secondary Node from Production environment as Master.

  • Scenario 3 : Production DC is goes down - DR Nodes cannot start election as its not able to achieve Quorum ----→ [Downtime] i.e No write nodes are available - DR DC Server can still server read request but unable to become master for write request. ----> What next - We need to do manually recofigure replicaset & make one primary.

Pattern-4)

3 nodes in production & 4 nodes in DR [3 Secondary + 1 Arbiter Node]

  1. Need to add 3 DR Nodes as secondary with lowest election priority [0.5 Rec.] & 1 Arbiter node

  2. Prod ---> DR ops log repication should be [Secondary ------> Secondary ] to reduce master workload [Need to verify this options***]

  3. If the current primary cannot see a majority of voting members, it will step down and become a secondary.

  • Scenario 1 : Any of the secondary nodes goes down in DR infra/Prod infra - No impact Primary will still serve traffic.

  • Scenario 2 : Production Master goes down.
    Trigger election & select any of the Secondary Node from Production environment as Master.

  • Scenario 3 : Production DC is goes down, We have majority of Nodes in DR infra i.e 3 Sec + 1 Arbiter So it will start election & Select one server as master

  • Scenario 4 : Production up & back in live, New election will not happen Automatically & Need to do a manual failover by trigger the election by stopping the Master node in DR - [ Election will always select highest Priory nodes from Prod Infra
    (** Health check and Automatic failback can be achieve using scripts / get it done through monit project.)

  • Scenario 5 : Network connectivity is issue /Tunnel Down / DR Sites goes down
    Prod Master not able to maintain the quorum it will stepdown and become SECONADY & DR Nodes have majority of voting members it will elect one Master Nodes. --> [Downtime] i.e No write nodes are available. [DR DC failure cause Downtime in Production]

Steps for re-configuring a replica set when only a minority of members are accessible- https://docs.mongodb.com/manual/tutorial/reconfigure-replica-set-with-unavailable-members/ [Scenrio 3 & 1]

Final Notes:

Since we are Adding DR Servers as replica-set members of Primary Cluster , there is a risk of of Connectivity/failure and network partitions and cause downtime on Production DC, this will happen most of the scenarios/patterns which tested above. so using replicaset members in this scenarios is not look a best approaches.

Please check part-2 article for exploring other options: (I'll add a link here once its live)
1. Mongoshake
2. Mongosync

Thanks for reading:
📚 "Stay in the know! Subscribe to the blog for the latest insights on tech trends, Site Reliability engineering , DevOps and Cloud strategies, and more. 📚#StayTuned
- Site Reliability.in

Did you find this article valuable?

Support Midhun K by becoming a sponsor. Any amount is appreciated!