How to achieve fast mongo upserts when upsertfield is unique ?

3 min readAug 30, 2020

Lately I was trying to ingest millions of document into mongo with upsert pattern. The dataset has a structure like below .

{_id : "<custom unique value>", some_other_field : "<string>"}

One of the key points of this dataset was that each _id appears once during the ingestion. The dataset being a jsonlines file (newline separated json objects), the immediate candidate for the ingestion was mongoimport tool.

The mongo (3.6) setup was a sharded cluster with 3 secondaries per shard. But the ingestion seems to be taking longer than expected.

So I did the following test, after googling for few mins:

//data set for upsert using mongoimport (1M documents)
{"_id": "0", "data": {"language": "english"}}
{"_id": "1", "data": {"language": "english"}}
{"_id": "2", "data": {"language": "english"}}
{"_id": "3", "data": {"language": "english"}}
.
.
.
mongoimport -d test -c testmongoupsert --type json --file test.json  --mode upsert

Below is the python script that uses bulk_write api with unordered upsert provided in python-mongo-driver

// Upsert 1M documentsfrom pymongo import UpdateOne, MongoClientclient = MongoClient()  # or some remote sharded cluster
collection = client.test.test
operations = []
for i in range(0, 1_000_000):
    # Set a random number on every document update
    operations.append(
        UpdateOne({ "_id": str(i) },{ "$set": {"data": {"language": "english"}} }, upsert=True)
    )# Send once every 1000 in batch
    if ( len(operations) == 1000 ):
        collection.bulk_write(operations,ordered=False)
        operations = []if ( len(operations) > 0 ):
    collection.bulk_write(operations,ordered=False)

Result :

Mongodump took 30 sec upsert 1% of the data.
The python script took 45 sec to 1 min for the entire dataset.

Digging into why mongoimport is performing poorly, I found that it does not support unordered upserts which makes sense in cases where order of upserts matter. Since the upserts are sequential, we loose parallelism.

https://github.com/mongodb/mongo-tools/blob/master/mongoimport/mongoimport.go#L238

In my use case, _id is the upsertField and occurs once per ingestion, the order of upserts don’t matter and this constraint of mongoimport is a bottleneck. So the option we are left is to use a bulk_write api in supported mongo drivers with ordered parameter set to false to achieve fast unordered upserts with a degree of parallelism.

References:

Fast or Bulk Upsert in pymongo

How can I do a bulk upsert in pymongo? I want to Update a bunch of entries and doing them one at a time is very slow…

stackoverflow.com

mongoimport - MongoDB Manual

The tool imports content from an Extended JSON , CSV, or TSV export created by , or potentially, another third-party…

docs.mongodb.com

TOOLS-1956 Add Bulk Upsert and Remove Modes and increase batch size limit by calebHankins · Pull…

The below changes were implemented after consulting with our Mongo rep Anant Srivastava to meet internal implementation…

github.com

How to achieve fast mongo upserts when upsertfield is unique ?

Fast or Bulk Upsert in pymongo

How can I do a bulk upsert in pymongo? I want to Update a bunch of entries and doing them one at a time is very slow…

mongoimport - MongoDB Manual

The tool imports content from an Extended JSON , CSV, or TSV export created by , or potentially, another third-party…

TOOLS-1956 Add Bulk Upsert and Remove Modes and increase batch size limit by calebHankins · Pull…

The below changes were implemented after consulting with our Mongo rep Anant Srivastava to meet internal implementation…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Saikumar Chintada

No responses yet