How to achieve fast mongo upserts when upsertfield is unique ?

Saikumar Chintada
3 min readAug 30, 2020

Lately I was trying to ingest millions of document into mongo with upsert pattern. The dataset has a structure like below .

{_id : "<custom unique value>", some_other_field : "<string>"}

One of the key points of this dataset was that each _id appears once during the ingestion. The dataset being a jsonlines file (newline separated json objects), the immediate candidate for the ingestion was mongoimport tool.

The mongo (3.6) setup was a sharded cluster with 3 secondaries per shard. But the ingestion seems to be taking longer than expected.

So I did the following test, after googling for few mins:

//data set for upsert using mongoimport (1M documents)
{"_id": "0", "data": {"language": "english"}}
{"_id": "1", "data": {"language": "english"}}
{"_id": "2", "data": {"language": "english"}}
{"_id": "3", "data": {"language": "english"}}
.
.
.
mongoimport -d test -c testmongoupsert --type json --file test.json --mode upsert

Below is the python script that uses bulk_write api with unordered upsert provided in python-mongo-driver

// Upsert 1M documentsfrom pymongo import UpdateOne, MongoClientclient = MongoClient()  # or some remote sharded cluster
collection = client.test.test
operations = []
for i in range(0, 1_000_000):
# Set a random number on every document update
operations.append(
UpdateOne({ "_id": str(i) },{ "$set": {"data": {"language": "english"}} }, upsert=True)
)
# Send once every 1000 in batch
if ( len(operations) == 1000 ):
collection.bulk_write(operations,ordered=False)
operations = []
if ( len(operations) > 0 ):
collection.bulk_write(operations,ordered=False)

Result :

Mongodump took 30 sec upsert 1% of the data.
The python script took 45 sec to 1 min for the entire dataset.

Digging into why mongoimport is performing poorly, I found that it does not support unordered upserts which makes sense in cases where order of upserts matter. Since the upserts are sequential, we loose parallelism.

https://github.com/mongodb/mongo-tools/blob/master/mongoimport/mongoimport.go#L238

In my use case, _id is the upsertField and occurs once per ingestion, the order of upserts don’t matter and this constraint of mongoimport is a bottleneck. So the option we are left is to use a bulk_write api in supported mongo drivers with ordered parameter set to false to achieve fast unordered upserts with a degree of parallelism.

References:

--

--