How to achieve fast mongo upserts when upsertfield is unique ?

Saikumar Chintada

Lately I was trying to ingest millions of document into mongo with upsert pattern. The dataset has a structure like below .

{_id : "<custom unique value>", some_other_field : "<string>"}

One of the key points of this dataset was that each _id appears once during the ingestion. The dataset being a jsonlines file (newline separated json objects), the immediate candidate for the ingestion was mongoimport tool.

The mongo (3.6) setup was a sharded cluster with 3 secondaries per shard. But the ingestion seems to be taking longer than expected.

So I did the following test, after googling for few mins:

//data set for upsert using mongoimport (1M documents)
{"_id": "0", "data": {"language": "english"}}
{"_id": "1", "data": {"language": "english"}}
{"_id": "2", "data": {"language": "english"}}
{"_id": "3", "data": {"language": "english"}}
.
.
.
mongoimport -d test -c testmongoupsert --type json --file test.json --mode upsert

Below is the python script that uses bulk_write api with unordered upsert provided in python-mongo-driver

// Upsert 1M documentsfrom pymongo import UpdateOne, MongoClientclient = MongoClient()  # or some remote sharded cluster
collection = client.test.test
operations = []
for i in range(0, 1_000_000):
# Set a random number on every document update
operations.append(
UpdateOne({ "_id": str(i) },{ "$set": {"data": {"language": "english"}} }, upsert=True)
)
# Send once every 1000 in batch
if ( len(operations) == 1000 ):
collection.bulk_write(operations,ordered=False)
operations = []
if ( len(operations) > 0 ):
collection.bulk_write(operations,ordered=False)

Result :

Mongodump took 30 sec upsert 1% of the data.
The python script took 45 sec to 1 min for the entire dataset.

Digging into why mongoimport is performing poorly, I found that it does not support unordered upserts which makes sense in cases where order of upserts matter. Since the upserts are sequential, we loose parallelism.

https://github.com/mongodb/mongo-tools/blob/master/mongoimport/mongoimport.go#L238

In my use case, _id is the upsertField and occurs once per ingestion, the order of upserts don’t matter and this constraint of mongoimport is a bottleneck. So the option we are left is to use a bulk_write api in supported mongo drivers with ordered parameter set to false to achieve fast unordered upserts with a degree of parallelism.

References:

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Saikumar Chintada
Saikumar Chintada

Written by Saikumar Chintada

Computer Science Enthusiast and Engineer.

No responses yet

Write a response