Multi-Document Transactions on MongoDB 4.0

Let's look at multi-document transactions on MongoDB 4.0 and see ACID and Transaction Replica.

Leona Zhang

Aug. 14, 18 · Tutorial

Likes (3)

Comment

Save

18.2K Views

At the end of last month, MongoDB World announced the release of MongoDB 4.0, which supports multi-document transactions against replica sets. Alibaba Cloud ApsaraDB Database R&D engineers first analyzed the source code of transaction functions and parsed the transaction implementation mechanism. In this article, we will discuss the implementation of multi-document transactions on MongoDB 4.0.

What's New on MongoDB 4.0

The transaction functionality introduced by MongoDB 4.0 supports multi-document ACID features such as transaction operations using the mongo shell.

> s = db.getMongo().startSession()
session { "id" : UUID("3bf55e90-5e88-44aa-a59e-a30f777f1d89") }
> s.startTransaction()
> db.coll01.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> db.coll02.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> s.commitTransaction()  (or s.abortTransaction() rollback transaction)

Other language drivers that support MongoDB 4.0 also encapsulate transaction-related APIs. To do this, users need to create a Session, then open and commit the transaction on the Session. For example:

Python Version

with client.start_session() as s:
    s.start_transaction()
    collection_one.insert_one(doc_one, session=s)
    collection_two.insert_one(doc_two, session=s)
    s.commit_transaction()

Java Version

try (ClientSession clientSession = client.startSession()) {
   clientSession.startTransaction();
   collection.insertOne(clientSession, docOne);
   collection.insertOne(clientSession, docTwo);
   clientSession.commitTransaction();
}

Session

Session is a concept introduced in MongoDB 3.6. This feature is mainly used to achieve multi-document transactions. Session is essentially a "context."

In previous versions, MongoDB only managed the context of a single operation. The mongod service process received a request, created a context for the request (corresponding to OperationContext in the source), and then used the context throughout the request. The content of the context includes request time consumption statistics, lock resources occupied by the request, storage snapshots used by the request, etc.

With Session, you can have multiple requests share a single context, and multiple requests can be correlated to support multi-document transactions.

Each Session contains a unique identifier lsid. In version 4.0, each request of the user can specify additional extension fields, including:

lsid: The ID of the session where the request is located, also known as the logic session id.
txnNmuber: The transaction number corresponding to the request, which must be monotonically incremented within a Session.
stmtIds: It corresponds the ID to each operation in the request (in the case of insert, an insert command can insert multiple documents).

In fact, when users use transactions, they do not need to understand these details because they will be automatically handled by MongoDB Driver. The Driver will allocate the lsid when creating Session. Then the Driver will automatically add lsids to all subsequent operations in this Session. For a transaction operation, the txnNumber will automatically added.

It is worth mentioning that the lsid of the Session can be allocated by the server by calling startSession command, or by the client, so that the network overhead is saved. For the transaction identifier, MongoDB does not provide a separate startTransaction command, txnNumber is directly assigned by the Driver. The Driver only needs to guarantee that within a Session, txnNumber is incremented. When the server receives a new transaction request, it will start a new transaction actively.

When MongoDB startSession, it can specify a series of options to control the access behavior of the Session, including:

causalConsistency: Whether to provide the semantics of causal consistency, if it is set to true, MongoDB guarantees the semantics of "read your own write" no matter which node it reads. See causal consistency
writeConcern: MongoDB supports client-side flexible configuration write policies (writeConcern) to meet the needs of different scenarios.
readConcern: MongoDB can customize the write strategy through writeConcern. After version 3.2, readConcern was introduced to flexibly customize the read strategy.
readPreference: Rules for selecting nodes when setting read, see read preference
retryWrites: If it is set to true, in the scenario of replica sets, MongoDB will automatically retry the scene where re-selection occurs; see retryable write

ACID

Atomicity, Consistency, Isolation, and Durability (ACID) is a set of properties for database transactions aimed at ensuring the validity of transactions under all circumstances. ACID plays an essential role for multi-document transactions.

Atomic

For multi-document transaction operations, MongoDB provides an atomic semantic guarantee of "all or nothing". Data changes are only made visible outside the transaction if it is successful. When a transaction fails, all of the data changes from the transaction is discarded.

Consistency

Consistency is straightforward. Only permissible transactions are allowed on the database, which prevents database corruption by an illegal transaction.

Isolation

MongoDB provides a snapshot isolation level, creates a WiredTiger snapshot at the beginning of the transaction, and then uses this snapshot to provide transactional reads throughout the transaction.

Durability

When a transactions use WriteConcern {j: true}, MongoDB will guarantee that it is returned after the transaction log is committed. Even if a crash occurs, MongoDB can recover according to the transaction log. If the {j: true} level is not specified, even after the transaction is successfully committed, after the crash recovery, the transaction may be rolled back.

Transaction and Replica

In the replica set configuration, an oplog (a normal document, so the modification of the transaction in the current version cannot exceed the document size limit of 16 MB) will be recorded when the entire MongoDB transaction is committed, including all the operations in the transaction. The slave node pulls the oplog and replays the transaction operations locally.

Transaction oplog example, containing lsid, txnNumber of transaction operations, and all operation logs within the transaction (applyOps field)

"ts" : Timestamp(1530696933, 1), "t" : NumberLong(1), "h" : NumberLong("4217817601701821530"), "v" : 2, "op" : "c", "ns" : "admin.$cmd", "wall" : ISODate("2018-07-04T09:35:33.549Z"), "lsid" : { "id" : UUID("e675c046-d70b-44c2-ad8d-3f34f2019a7e"), "uid" : BinData(0,"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=") }, "txnNumber" : NumberLong(0), "stmtId" : 0, "prevOpTime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "o" : { "applyOps" : [ { "op" : "i", "ns" : "test.coll2", "ui" : UUID("a49ccd80-6cfc-4896-9740-c5bff41e7cce"), "o" : { "_id" : ObjectId("5b3c94d4624d615ede6097ae"), "x" : 20000 } }, { "op" : "i", "ns" : "test.coll3", "ui" : UUID("31d7ae62-fe78-44f5-ba06-595ae3b871fc"), "o" : { "_id" : ObjectId("5b3c94d9624d615ede6097af"), "x" : 20000 } } ] } }

The entire replay process is as follows:

Get the current batch (the back-end continues to pull the oplog into the batch)
Set the OplogTruncateAfterPoint timestamp to the timestamp of the first oplog in the batch (stored in the set local.replset.oplogTruncateAfterPoint)
Write all the oplogs in the batch to the set local.oplog.rs. If the number of oplogs is large, the write acceleration will be used.
Clear OplogTruncateAfterPoint, and mark oplog to be completely successfully written. If a crash occurs before this step is completed, after restarting and recovery, it is found that oplogTruncateAfterPoint is set, then the oplog is truncated to the timestamp to restore a consistent status.
The oplog is divided into multiple threads for concurrent replay. In order to improve the efficiency of the concurrency, the oplog generated by the transaction contains all the modifications, and will be divided into multiple threads according to the document ID like the oplog of a normal single operation.
Update the ApplyThrough timestamp to the timestamp of the last oplog in the batch. After marking the next restart, resynchronize from that location. If it fails before this step, the oplog will be pulled from the last value of ApplyThrough (the last oplog of the previous batch).
Update the oplog visible timestamp. If other nodes synchronize from the slave node, this part of the newly written oplog can be read.
Update the local snapshot (timestamp) and the new write will be visible to users.

Transaction and Storage Engine

Unified Transaction Timing

WiredTiger has supported transactions for a long time. In versions 3.x, MongoDB uses WiredTiger transactions to guarantee the modification atomicity of data, index, and oplog. But in fact, MongoDB provides a transaction API after iterations of multiple versions. The core difficulty is timing.

MongoDB uses the oplog timestamps to identify the global order, while WiredTiger uses the internal transaction IDs to identify the global order. In terms of implementation, there is no association between the two. This results in a transaction commit order that MongoDB sees inconsistent with the transaction commit order that WiredTiger sees.

To solve this problem, WiredTier 3.0 introduces a transaction timestamp mechanism. The application can explicitly assign a commit timestamp to the WiredTiger transaction through the WT_SESSION::timestamp_transaction API, and then achieve the specified timestamp read (read "as of" a timestamp). With the read "as of" a timestamp feature, when the oplog is replayed, the read on the slave node will no longer conflict with the replayed oplog, and the read request will not be blocked by replaying the oplog. This is a significant improvement in the version 4.0.

/*
 * __wt_txn_visible --
 *  Can the current transaction see the given ID / timestamp?
 */
static inline bool
__wt_txn_visible(
    WT_SESSION_IMPL *session, uint64_t id, const wt_timestamp_t *timestamp)
{
    if (!__txn_visible_id(session, id))
        return (false);

    /* Transactions read their writes, regardless of timestamps. */
    if (F_ISSET(&session->txn, WT_TXN_HAS_ID) && id == session->txn.id)
        return (true);

#ifdef HAVE_TIMESTAMPS
    {
    WT_TXN *txn = &session->txn;

    /* Timestamp check. */
    if (!F_ISSET(txn, WT_TXN_HAS_TS_READ) || timestamp == NULL)
        return (true);

    return (__wt_timestamp_cmp(timestamp, &txn->read_timestamp) <= 0);
    }
#else
    WT_UNUSED(timestamp);
    return (true);
#endif
}

As you can see from the above code, after the transaction timestamp is introduced, the timestamp is additionally checked when the visibility is determined. When the timestamp read is specified for upper read, only the data before the timestamp can be seen. MongoDB associates the oplog timestamp with the transaction when committing the transaction so that the timing of MongoDB Server layer is consistent with that of the WiredTiger layer.

Impact of Transaction on the Cache

The WiredTiger (WT) transaction opens a snapshot, and the presence of snapshot impacts the WiredTiger cache evict. On a WT page, there are N modification versions. If these modifications are not globally visible (__wt_txn_visible_all), this page cannot be evicted (__wt_page_can_evict).

In versions 3.x, modification of a write request to the data, index, and oplog will be placed in a WT transaction, the transaction commit is controlled by MongoDB, and MongoDB will commit the transaction as soon as possible to complete the write request; but after the transaction is introduced in the version 4.0, the transaction commit is controlled by the application. There may be a lot of transaction modifications, and the transaction may not be committed for a long time. This will have a great impact on the WT cache evict. If a large amount of memory cannot be evicted, it will eventually goes to the status of cache stuck.

In order to minimize the WT cache pressure, the MongoDB 4.0 transaction function has some restrictions, but when the transaction resource exceeds a certain threshold, it will automatically abort to release the resource. The rules include:

The life cycle of a transaction cannot exceed transactionLifetimeLimitSeconds (60 seconds by default), which can be modified online.
The number of documents modified by a transaction cannot exceed 1,000, which cannot be modified.
The oplog generated by a transaction modification cannot exceed 16 MB, which is the document size limit of MongoDB. The oplog is also a normal document and must comply with this constraint.

Read as of a Timestamp and Oldest Timestamp

Read as of a timestamp relies on WiredTiger to maintain multiple versions in memory. Each version is associated with a timestamp. As long as the MongoDB layer may need to read the version, the engine layer must maintain the resources of this version. If there are too many reserved versions, it puts a lot of pressure on the WT cache.

WiredTiger provides the ability to set the oldest timestamp, which allows MongoDB to set the timestamp, meaning that read as of a timestamp does not provide a smaller timestamp for consistent reads, that is, WiredTiger does not need to maintain all history versions before oldest timestamp. The MongoDB layer needs to update the oldest timestamp frequently (in a timely manner) to avoid putting too much pressure on the WT cache.

Engine Layer Rollback and Stable Timestamp

In versions 3.x, the rollback operation of the MongoDB replica set is done at the Server layer, but when a node needs to be rolled back, the reverse operation is continuously applied according to the oplog to be rolled back, or the latest version is read from the rollback source. The entire rollback operation is inefficient.

The version 4.0 implements the rollback mechanism for the storage engine layer. When the replica set node needs to be rolled back, it directly calls the WiredTiger API to roll back the data to a stable version (a checkpoint). This stable version depends on stable timestamp. WiredTiger will ensure that data after the stable timestamp is not written to the checkpoint. According to the synchronization state of the replica set, MongoDB will update the stable timestamp when the data has been synchronized to most nodes (majority committed). Since the data has been committed to most nodes, no rollback will occur, and the data before this timestamp can be written to the checkpoint.

MongoDB needs to ensure frequent updates to stable timestamp (in a timely manner), otherwise, WT checkpoint behaviors will be impacted, resulting in a lot of memory not being released.

Distributed Transactions

MongoDB 4.0 supports multi-document transactions against replica sets and plans to support sharding cluster transaction functionality in version 4.2. The following is a functional iteration diagram from MongoDB 3.0 introducing WiredTiger to 4.0 supporting multi-document transactions.

Image title

If you enjoyed this article and want to learn more about MongoDB, check out this collection of tutorials and articles on all things MongoDB.

MongoDB Database Requests WiredTiger Data (computing)

Published at DZone with permission of Leona Zhang. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending