Introduction:
Many a time when working on an application we want to make multiple co-related reads and writes to database to complete a single unit of work. These operations are typically triggered by a single action from an application user. For example:
A simple debit-credit transaction in a banking application is triggered by a fund transfer from one account to another.
Borrowing a book from a library involves checking if the book is already borrowed and, if not, creating an entry for the user who is borrowing the book while updating the book's entry to indicate it is not available until a certain date.
Since these operations are interrelated, we need a way to ensure they are executed atomically, in an all-or-nothing manner, in a safe environment free from errors, concurrency issues, and durability concerns. To address these issues, databases provide an abstraction called transactions, allowing us to group several reads and writes into a single logical unit.
Transactions provide safety guarantees described by the acronym ACID.
Atomicity - Ensures the whole transaction either completes (commits) or aborts (rolls back). If an error occurs halfway through a transaction, any previously made writes are discarded.
Consistency - Ensures the data remains consistent before and after the transaction.
Isolation - ensure that concurrently executing transactions are isolated from one another.
Durability - Guarantees that once a transaction is completed, any data written is not lost, even in case of failure.
Now going further we will explore in detail how databases support isolation to provide guarantees around concurrent transactions and the kinds of problems it helps us solve.
Isolation levels and read/write phenomenon:
Isolation in databases means that concurrently executing transactions operating on the same data are isolated from one other, i.e. they don't interfere with each other and any committed transaction would result as if they had run serially( also known as serializable isolation). If we were to take this definition to its word all would be great for an application developer, but reality is much different from this.
In reality isolation guarantees provided by databases are weak levels of isolation, because serializable isolation have a performance penalties and is seldom used by default. A weak isolation level causes several concurrency problems which the developer needs to be aware of to make informed decisions, so let's look into different isolation levels provided by databases and what kind of problems they can and cannot solve.
Dirty reads, Dirty writes and Read committed isolation:
DIRTY READ:
If a transaction T1 has written some data to a database but has not yet committed, can another transaction T2 read the same data? If T2 can see this uncommitted data, it is considered a dirty read because it is reading a value that is not guaranteed to be committed later.
Example:
Here T2 can read the value set by T1 even before T1 transaction commits. So later when T1 transaction aborts T2 is left with a stale value.
DIRTY WRITE:
Similarly, consider two separate transactions that are trying to update the same data in a database. Say the first transaction T1 has updated the data but not committed yet, in this scenario can the second transaction T2 overwrite this uncommitted value? If yes then it is considered a dirty write because the uncommitted value by T1 is overwritten by T2.
Example :
Here first T2 updates a data to an initial value, but then concurrently T1 updates the same data to a different value, so the value written by T2 which is not committed yet has been overwritten by a different value altogether.
In both the above cases we can see that an uncommitted operation of a transaction steps foot on another concurrently running transaction operating on the same data. The isolation level which protects against this is READ COMMITTED ISOLATION.
Read committed isolation ensures two things:
When reading, a transaction will only see a committed value.
When writing, a transaction will only overwrite a committed value.
So by choosing read committed isolation dirty reads would be avoided as a transaction won't see an uncommitted stale value anymore and only see a committed value.
And dirty writes would be also avoided as an uncommitted write of a transaction won't be overwritten by another concurrent transaction.
Example:
The updatedData is only visible to user2 once user1 commits his transaction.
Read committed isolation seems to solve the problem until we dig a little deeper and figure out that there are still some corners that need to be patched.
Repeatable read and snapshot isolation:
Imagine two concurrent transactions, the 1st transaction T1 reads some data and right after that a 2nd transaction T2 updates the same data and commits, now if T1 again reads the same data it will get the new updated data. So now T1 got two different values for the same data within the same transaction and this is not desirable some times. This anomaly is also known as non-repeatable read and read committed isolation cannot protect from this.
The isolation level which can help us here is Snapshot isolation. By using snapshot isolation transaction reads data from a consistent snapshot of the database i.e once a long running transaction begins using snapshot isolation level on reading the same data multiple times it is guaranteed that the value read would be the same even if it is concurrently modified and updated by another transaction. This turns out to be really helpful for long running read-only queries for backups or analytics.
This is achieved by keeping several different committed version of an object and hence this technique is also known as multi-version concurrency control.
Lost updates:
With Snapshot isolation we have protected long running transaction from read anomalies but we still need to consider issues of transactions writing concurrently particularly the read-modify-write-cycle.
Consider two users with two concurrent transactions T1 and T2. T1 and T2, both read a=100. T1 updates the value to 120(100+20) and commits, then T2 updates it to 80(100-20)and commits. When T1 rereads the data, the update they made is lost, known as a lost update. This happens because T2 is unaware of T1's modification.
To tackle this problem databases provide few solutions:
Compare and set operations:
In some databases in the update request we can add a condition to check whether the data that is being updated has the same value as the one that was read, i.e ensuring that it is unmodified.
In our above example T2 can add a condition along with the update statement of (100-20) to only execute it if the value of 'a' is still 100 else abort the update.
Atomic write:
Databases that provided atomic write remove the need for the read step by allowing the user to send an expression on how to update the value.
In our case T2 instead of reading 'a' then sending the update of (100-20) would just send a single expression like "SET a = a -20" which is a complete expression that would subtract 20 from whatever value 'a' stored.
Explicit locking:
In this approach we lock the object that is going to be updated, such that any subsequent read request for that object are made to wait until the first read-modify-write cycle is complete. In our example this means that when T1 reads the value to be modified it prevents T2 from reading it concurrently and T2 has to wait until T1 completes its update.
Serializable isolation:
Serializable isolation is the strictest form of isolation in databases.It guarantees that in case of concurrently running transactions the end result creates the effect as if the transactions had been run one at a time in a serial order. This ensures data correctness by preventing all anomaly which we previously saw by weaker isolation levels.
Now a question arises as to why don't we use serializable isolation by default? To answer this let's explore the trade-offs and performance penalties.
Serial execution:
In his approach concurrency is entirely removed by actually running the transaction one at a time serially on a single thread. This completely removes the need for handling conflicts which would have happened otherwise.
Cons:
Every transaction must be fast, as if its long running it might slow down/block other transactions.
Write throughput shouldn't be too high as a single CPU core needs to handle it.
Since the transactions are run serially this has a penalty on performance which are critical in high performance systems.
Two-Phase locking (2PL):
Two-phase locking serializes the transactions by having a strong lock requirements. The approach which it uses is also known as pessimistic concurrency control as it assumes that conflicts between concurrent transactions are likely to happen.
Say two transactions T1 and T2:
If T1 has read an object then T2 cannot write to that object it has to wait for T1 to commit or abort before it can write.
Also if T1 has written an object and T2 wants to read it, then T2 has to wait unit T1 commits or aborts before it can read it.
In 2PL transaction that are writing an object don't just block other writers but also readers on the same object. And this blocking is implemented by having locks on object. Locks can either be in shared mode or exclusive mode.
Examples: (All operations in the example are on the same object)
Readers acquire the lock in shared mode and at the same time several readers can hold the lock in shared mode.
If a transaction wants to write to an object then it has to acquire the lock in exclusive mode and to do this no other transaction must have any lock on the same object. And once exclusive lock is acquired no other transaction can acquire shared or exclusive lock on the same object.
Cons:
Deadlock are common as it can happen that T1 is waiting for T2 to release a lock and vice versa.
Another big downside of 2PL is its performance, along with the overhead of maintaining locks,writers block other writers and readers as well and if the write transaction takes a long time to commit or abort it would that means that other transactions have to wait for the lock to be released. These factors make the throughput and performance significantly worse under 2PL.
Conclusion:
Ultimately, choosing the right isolation level depends on the application's use case and access patterns. By thoroughly understanding the application's requirements and expected behaviour, developers can select the isolation level that provides the optimal balance between performance and data consistency.