Wednesday, January 31, 2007

Understanding Transaction Isolation

自从开始作为一个developer, 就可以是接触和了解事务隔离级别(transaction isolation level) , 但是坦白说我还没有真正的理解过这个概念, 总是停留再一种似懂非懂的境界. 期间也去网上找过很多资料, 但是都是搜索中文网页, 老实说, 国内的网站技术文章太少了 ,而且都是一篇文章所有网站抄来抄去, 很悲哀到了现在还是似懂非懂, 难道要糊涂的生, 然后糊涂的死么?
昨天, 当我现在在这里写这些文字的时候,我一直在回想, 昨天是什么促使我突然又开始关注事务隔离级别这个话题. 想不起来了, 昨天晚上公司年夜饭, 喝了不少红酒, 还喝了一杯白酒, 回家之后倒头就睡,过去的记忆仿佛都在昨天被抛弃了. 就是说想不起来了, 想不起来也无所谓, 这件事情还是要做的, 所以今天又google了一把, 关键字是transaction isolation level, 嘿嘿,找到了一些资源, 先看看. 写这些文字的时候, 我还根本没看那些找到的文章, 有点颠倒是非. 先写道这里, 看了之后, 然后再说我是否已经真正理解了事务隔离级别这个概念. 理解了我便要在这里写下我的理解.
下面这篇文章不错, 我决定直接copy过来.
--------------------------

Understanding Transaction Isolation

http://www.expresscomputeronline.com/20040426/techspace01.shtml

Article summary

This is a topic related to SQL Server. However, the concept is applicable to any RDBMS. While auditing many applications, I have found that incomplete or no understanding of ‘Isolation Level’ leads to a lot of real life problems. These include performance degradation, blocking, locking as well as major deadlocks.

This article provides an easy-to-understand view of what ‘Isolation Levels’ really mean and when to use which level.

‘Isolation Level’ is a setting that decides how data which is a part of an ongoing transaction is made visible to other transactions.

Transactions are well understood. So what is the problem?

As all of us know, it is a unit of work. The work may contain many steps but either all steps happen or all steps don’t happen. This is fairly well known. Nothing new here. But consider how transactions occur in real life.

  • I am updating one row in a Begin Tran and Commit section. I have issued Begin Tran as well as the Update statement. Commit is not yet issued – because I want to perform some more actions in the same transactions on some other tables. If some other user wants to read this row that I have updated but not yet committed, what should happen?
  • Consider another scenario. I start a transaction. I calculate a total of a field based upon all rows in a table. Now, I need to add a new record in another table which contains this total. Now, can the original table be changed by some other user after I calculate the total? In which case, there could be a mismatch. Do you want to take such chances?
  • Another scenario. I am working on some transaction table between a range of keys within a transaction—say 10 and 20. There were only 5 records when I read the range – 10, 12, 14, 16, 20. Now I am working on other things in the transaction. Before I could commit the transaction, someone added another row with a key value of 11. Now, my base assumption about what records I read between 10 and 20 and further work upon them itself is wrong. Problem!!! Is it not?
  • I start reading a long table. It is not a transaction at all. But other users want to refer to that table for updating some fields in specific rows. The query takes 20 minutes to read all the rows. What happens to other users who are trying to update the rows? Do they wait for 20 minutes? Or they are allowed to update the rows even when the rows are being read in a large query? What if the query was used to generate a summary report containing grand totals? The total would be wrong because after the summation started, some rows have changed. Some of these rows could have changed after the summation occurred. What’s to be done now?

As you can see all these situations are confusing and prone to inaccuracies. To avoid such problems we have a feature called “Isolation Levels”.

Isolation Levels are applicable to transactions. These decide the visibility of information which is a part of an ongoing transaction.

In very simple terms Isolation Levels decide “What happens when some data being referred to (for reading or writing) within an incomplete transaction is also being referred to (for reading or writing) from another connection (or user – actually it is called another ‘transaction lock space’)?”

To make things sound technical, all these problems have been given nice and complex sounding names.

Data visibility problems that can occur during a transaction

Let us understand some jargon.

  • Transaction Lock Space:

Each transaction performs certain operations (select / insert / update / delete) on one or more rows of one or more tables. Transaction starts with Begin Tran command and ends with Commit. If unsuccessful, it ends with Rollback command. Now, after Begin Tran is issues and before Commit is issued, multiple tables may participate in the transaction related commands. Specific parts of these tables need to be locked during this phase to ensure that other users do not interfere with this transaction. This is called Transaction Lock Space.

  • Uncommitted (dirty) data:

Consider this code snippet (Blue text is code; green indicates comments):

1. Begin Transaction

—we want to change the customer status from active = “yes” to active = “no”

2. Update customer Set active = “No”

where CustomerID = 2324

—some more commandss

3. Select * from customer where CustomerID = 2324

4. Commit Transaction

Now, if the transaction commits, the value will be “No”. If it does not commit (for whatever reason), the value will remain “Yes”.

Consider that the value has already been changed to “No”. But there are more commands to be executed before the entire transaction commits. These commands are time consuming and take, say, 3 minutes. During these three minutes, if some user outside the transaction lock space reads the value of Active field in the Customer table for ID 2324, what should they see? “Yes” or “No”? The answer is simple. The value is changed but not committed. Therefore, external queries should still show “Yes”.

Now what would happen if a Select active from Customer where CustomerID – 2324 returns “No” to another user? What happens if the transaction rolls back? This is called Uncommitted Data. Ideally this should not be visible outside the transaction.

However, consider the same command executed within the transaction (line 4). It should – and it will return “No”.

Now let us consider various problems that can occur. The problems can be of three types:

1. Dirty Read

2. Non-repeatable read

3. Phantom rows

- Dirty reads

This is when connections outside the transaction space can read the uncommitted data. You don’t want this to happen in most cases. The only reason when you may want to allow this is when you are creating a report which assumes a certain amount of inaccuracy. You think this situation is rare? Not really. Suppose the report compares the current month sale with last month sale and returns a percentage difference. Consider that the report takes 5 minutes to generate. During these 5 minutes, more transactions may be getting added to the sales data. Typically 5 transactions would get added to the sales table. The average transaction value is Rs. 1000. The total sale for the month is typically 30-40 lacs. In such cases, you really don’t need to worry about absolute accuracy. You can intentionally allow dirty reads because the impact is really not significant from a business perspective.

- Non-repeatable read

Consider the above code. We change the value of Active to “No” in line 2. Now in line 4 you expect the value to be “No” because we are querying the row ‘within’ the transaction.

Now what would happen if some other transaction was allowed to change the same value to, say, “Maybe”? In line 4, we would expect value of “No” but actually get value of “Maybe”. This problem is called non-repeatable read. The idea is that within a transaction, the same data read any number of times should yield consistent results. If it is not same then the reads are ‘non-repeatable’.

- Phantom Rows

This was explained in the introduction (range 10 to 20 example). The concept is simple. If you have already read a range of data based upon a key value, another transaction inserts a new row which happens to have a value between this range. The original range which was referred to would now become invalid. This is because a “Phantom row” was added. This is also a major problem.

Important learning

Although the above three issues are listed as problems, they may not always be considered as problems! Paradoxical? Not really. As explained for Dirty Read, whether it is a problem or not depends entirely upon the business context. Technology exists to allow these problems to occurs as well as prevent them. The choice is yours. I know this complicates matters. But if these matters were not complicated, what are IT professionals paid for!

Now let us see how to prevent these problems? The answer is “using Isolation Levels”.

Types of Isolation Levels

To solve the problem of … The Isolation Level should be…
Dirty Read Read Committed (Default of SQL Server)

Dirty Read and Non-Repeatable Read Repeatable Read Non-Repeatable Read

Dirty Read and Non-Repeatable

Read and Phantom Rows

Serializable
To retain all three problems Read Uncommitted

Before we discuss each isolation level, let us understand how to set or change it in T-SQL.

Syntax

SET TRANSACTION ISOLATION LEVEL
{ READ COMMITTED
| READ UNCOMMITTED
| REPEATABLE READ
| SERIALIZABLE
}

Remember:

- This setting applies to all transactions which are happening within a single connection.

- Multiple connections can have different Isolation Levels.

- Within a single connection, you can change the Isolation Level for specific transactions.

- You must know the kind of transaction you are performing and its business requirements before you can decide the right isolation level.

- Isolation level does not just affect the current transactions in a given connection. It also affects how other transactions issued within other connections will behave.

- Therefore, you should have a good idea of what kind of concurrent transactions occur in your application when the system is live and many users are active.

This is the single biggest problem which leads to low performance, locking and blocking, as well as deadlocks

- Most developers think of individual transactions as though they are occurring in a single user system.

- It is difficult to envisage all permutations of possible transactions which can occur together. But it is very much possible to plan for it in a structured and informed manner. This requires additional effort, monitoring live systems and tweaking of specific transactions and so on.

- Unfortunately, this additional effort is rarely a part of system deployment! Such problems surface only after the system load increases in such a way that small, unnoticed issues become amplified due to large volume of data and / or large number of concurrent users.

- Only one of the options can be set at a time.

- It remains in effect till you explicitly change the option by issuing another command. This is an important aspect to consider.

Best Practice : When you change the Isolation Level for a transaction, remember to set it back to the original level after the transaction is completed.

Isolation levels

  • Read Uncommitted

This is as good (or bad) as not having any isolation. All data which is uncommitted is readable from any connection. This should not be used unless you have a very good reason to do so.

  • Read Committed

This prevents dirty reads. This does not prevent phantoms or non-repeatable reads. This is the default. Although it is the default, it does not mean that this isolation level is ideal for all transactions. One of the regular tendencies amongst techies is to simply use default values without giving it a second thought! I cannot imagine the number of phantom and non-repeatable reads that must be occurring in the world because someone simply used the ‘default’ value. It is a scary thought to say the least.

This level is obviously more restrictive than the ‘Read Uncommitted’ level.

  • Repeatable read

This prevents dirty reads as well as non-repeatable reads. It does not prevent phantom rows. This is more restrictive than Read Committed level. When I say restrictive what does it mean? It means that the chances of other transactions having to wait for this one to finish are INCREASED. Another way of saying this is – Repeatable Read reduces concurrency compared to Read Committed level.

  • Serializable

This is the most restrictive of the options. This should never be used as the default level. If you use this one as the default, it will most probably create a single user system!

This prevents all three problems.

How do Isolation Levels work?

You will notice that we have not yet addressed this issue at all in this article. The answer to this question is another Pandora’s box. The answer is simple. It uses the appropriate types of locks to achieve the desired effects. We will see what locks are and how they are implemented using Isolation Levels in the next article.

In the meantime, please go through your code and see if you are simply using the default level or are there some transactions which merit a different isolation level. If you feel like changing a level, never do it in production system. It can create havoc. Try it out in a test environment first, satisfy yourself that it has no side effects and then implement it in production system.

Don’t stop there. Monitor the production system. Some problems are never detected in test systems. They may occur in production. Handle them as required.

If you have a packaged product which handles large amount of data and many concurrent users, you must analyse the appropriate usage of Isolation Levels.