How to achieve fast mongo upserts when upsertfield is unique ?

Lately I was trying to ingest millions of document into mongo with upsert pattern. The dataset has a structure like below .
{_id : "<custom unique value>", some_other_field : "<string>"}
One of the key points of this dataset was that each _id appears once during the ingestion. The dataset being a jsonlines file (newline separated json objects), the immediate candidate for the ingestion was mongoimport tool.
The mongo (3.6) setup was a sharded cluster with 3 secondaries per shard. But the ingestion seems to be taking longer than expected.
So I did the following test, after googling for few mins:
//data set for upsert using mongoimport (1M documents)
{"_id": "0", "data": {"language": "english"}}
{"_id": "1", "data": {"language": "english"}}
{"_id": "2", "data": {"language": "english"}}
{"_id": "3", "data": {"language": "english"}}
.
.
.
mongoimport -d test -c testmongoupsert --type json --file test.json --mode upsert
Below is the python script that uses bulk_write api with unordered upsert provided in python-mongo-driver
// Upsert 1M documentsfrom pymongo import UpdateOne, MongoClientclient = MongoClient() # or some remote sharded cluster
collection = client.test.test
operations = []
for i in range(0, 1_000_000):
# Set a random number on every document update
operations.append(
UpdateOne({ "_id": str(i) },{ "$set": {"data": {"language": "english"}} }, upsert=True)
)# Send once every 1000 in batch
if ( len(operations) == 1000 ):
collection.bulk_write(operations,ordered=False)
operations = []if ( len(operations) > 0 ):
collection.bulk_write(operations,ordered=False)
Result :
Mongodump took 30 sec upsert 1% of the data.
The python script took 45 sec to 1 min for the entire dataset.
Digging into why mongoimport is performing poorly, I found that it does not support unordered upserts which makes sense in cases where order of upserts matter. Since the upserts are sequential, we loose parallelism.

In my use case, _id is the upsertField and occurs once per ingestion, the order of upserts don’t matter and this constraint of mongoimport is a bottleneck. So the option we are left is to use a bulk_write api in supported mongo drivers with ordered parameter set to false to achieve fast unordered upserts with a degree of parallelism.
References: