How to achieve fast mongo upserts when upsertfield is unique ?

Lately I was trying to ingest millions of document into mongo with upsert pattern. The dataset has a structure like below .

{_id : "<custom unique value>", some_other_field : "<string>"}

One of the key points of this dataset was that each _id appears once during the ingestion. The dataset being a jsonlines file (newline separated json objects), the immediate candidate for the ingestion was mongoimport tool.

The mongo (3.6) setup was a sharded cluster with 3 secondaries per shard. But the ingestion seems to be taking longer than expected.

So I did the following test, after googling for few mins:

//data set for upsert using mongoimport (1M documents)
{"_id": "0", "data": {"language": "english"}}
{"_id": "1", "data": {"language": "english"}}
{"_id": "2", "data": {"language": "english"}}
{"_id": "3", "data": {"language": "english"}}
.
.
.
mongoimport -d test -c testmongoupsert --type json --file test.json --mode upsert

Below is the python script that uses bulk_write api with unordered upsert provided in python-mongo-driver

// Upsert 1M documents

Result :

Mongodump took 30 sec upsert 1% of the data.
The python script took 45 sec to 1 min for the entire dataset.

Digging into why mongoimport is performing poorly, I found that it does not support unordered upserts which makes sense in cases where order of upserts matter. Since the upserts are sequential, we loose parallelism.

https://github.com/mongodb/mongo-tools/blob/master/mongoimport/mongoimport.go#L238

In my use case, _id is the upsertField and occurs once per ingestion, the order of upserts don’t matter and this constraint of mongoimport is a bottleneck. So the option we are left is to use a bulk_write api in supported mongo drivers with ordered parameter set to false to achieve fast unordered upserts with a degree of parallelism.

References:

Computer Science Enthusiast and Engineer.