Streaming MongoExport to Blob Store (S3)

Saikumar Chintada
2 min readFeb 22, 2022

UseCase: Take a one-time mongoexport of a 1 TB sharded dataset from mongo cluster and upload to azure blob/s3.

Challenges:

  • CPU: Heavy %cpu consumption
  • Memory: Moderate
  • Disk: Heavy IOPS consumption, Large Disk provisioning

Solutions:

  1. Simplest solution would be taking mongoexport dump onto disk and then upload to azure blob.
    Problem: Needs > 1 TB disk space and much more IOPS. Slow.
    Improvement: Parallelise mongoexport process per shard
  2. Better one is to stream output of mongoexport (from each shard) to azure blob per shard.
    mongoexport --host=$ip --db=$db --collection=$collection | python tap.py $db.$collection.$shard $location
tap.py : python script that takes in stdin stream and writes onto s3 blob without consuming significant amount of resources (disk/memory). Library used: https://github.com/RaRe-Technologies/smart_open

Conclusion:

Had issue with this approach where cpu was not sufficient and complete dataset was not ingested into blob. Over provisioning worked.

With 12 shards and each mongo-export (on each shard) spawning 4 processes, a total of 48 processes were spawned. So, I used 48 core machine for streaming the 1 TB collection from mongo to azure blob.

Instance Used : 48 cores, 96 GB Mem
CPU utilised : 30% — 33%
RAM utilised : 7% — 10%
Total Time : 4–5 hrs

--

--