Lessons learned from Downsizing Mongo 3.6 by removing Shards

Remove 4 out of 12 shards from production sharded mongo cluster without impacting production traffic within reasonable timeline.

  1. ChunkMigration/Draining: Drain the chunks present in the shard to other shards (chunk migration by enabling balancer)
    db.adminCommand( { removeShard : "rs1" } );
  2. MovePrimary : If draining shard is primary for some databases, they should be moved to a different shard. (We choose shards for removal which are not primary for any dbs.)
    db.adminCommand( { movePrimary : "fizz", to: "buzz"} );
  • Starting in MongoDB 4.4 (and 4.2.1), you can have more than one removeShard operation in progress. In MongoDB 4.2.0 and earlier, removeShard returns an error if another removeShard operation is in progress. (ref)
  • Removing one-shard-at-a -time implicates unnecessary chunk movement to unwanted shards (to other to-be removed shards), extending the timeline even more.
  • Another non-trivial point about chunk migration is that there can be at-most n/2 parallel chunks migrating across the cluster. Single node should participate in single chunk migration.

To have a better control over chunk migration and the concurrency, custom tool (in python) is opted.

We have 12 shards (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11).
Out of these we decided to drain (7, 8, 9, 11) as they are not primary for any database.

Each of fromShard (7, 8, 9, 11) is mapped to a couple of toShards as below.

With this approach, we have avoided sharing any shards between any two migrations and achieved as much parallelism as possible (4x). For each fromShard the chunks are load balanced to each of toShards. This would preserve the evenness in the distribution.

Chunk migration script

Issues during Custom Chunk migration :

  1. JumboChunks Issue: (Cannot move chunk Exception)

So, what is a “jumbo chunk”? It is a chunk whose size exceeds the maximum amount specified in the chunk size configuration parameter (which has a default value of 64 MB). When the value is greater than the limit, the balancer won’t move it. — https://www.percona.com/blog/2020/06/11/dealing-with-jumbo-chunks-in-mongodb/
Solution: Split the chunk and migrate.

2. Jumbo chunks that are not splittable: (single value in hash key range).
Solution: Delete and reinsert the documents.

3. Cannot have parallel splits within a collection : (Lock is acquired at collection level for splitting a chunk within the collection.)
Solution: Retry.

4. Due to share removal, expect rise in load on remaining nodes.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store