A nameless soldier...

Tuesday, May 29, 2007

Asynchronous calls and remote callbacks using Lingo Spring Remoting

很不错的技术文章，至少可以帮你澄清远程对象引用（传值/传引用），以及一些漂亮的编程技巧。
我觉得最酷的还是lingo居然可以将一个interface中的方法有的暴露成同步有的暴露成异步。但是在这篇文章的示例代码中，我没有搞清楚他是怎么定义solve是异步，而cancel和registerXX是同步的？

请参考：
http://jroller.com/page/sjivan?entry=asynchronous_calls_and_callbacks_using

As mentioned in my previous blog entry, Lingo is the only Spring Remoting implementation that supports asynchronous calls and remote callbacks. Today I'll cover all the nitty gritty details of the async/callback related functionality along with the limitations and gotchas.

Asynchronous method invocation and callback support by Lingo is an awesome feature and there are several usecases where these are an absolute must. Lets consider a simple and rather common use case : You have a server side application (say an optimizer) for which you want you write a remote client API. The API has methods like solve() which are long running and methods like cancel() which stops the optimizer solve.

A synchronous API under such circumstances is not really suitable since the solve() method could take a really long time to complete. It could be implemented by having the client code spawn their own thread and do its own exception management but this becomes really kludgy. Plus you have to worry out network timeout issues. You might be thinking "I'll just use JMS if I need an asynchronous programming model". You could use JMS but think about the API you're exposing. Its going to be a generic JMS API where the client is registering JMS listeners, and sending messages to JMS destinations using the JMS API. Compare this to a remote API where the client is actually working with the Service interface itself.

Lingo combines the elegance of Spring Remoting with the ability to make asynchronous calls. Lets continue with our Optimizer example and implement a solution using Lingo and Spring. OptimizerService interface

public interface OptimizerService {
  void registerCallback(OptimizerCallback callback) throws OptimizerException;

  void solve();

  void cancel() throws OptimizerException;
}

The solve() method is asynchronous while the cancel() and registerCallback(..) methods are not. Asynchronous methods by convention must not have a return value and also must not throw exceptions. The registerCallback(..) method registers a client callback with the Optimizer. In order to make an argument be a remote callback, the argument must implement java.util.EventListener or java.rmi.Remote. In this example the OptimizerCallback interface extends java.util.EventListener. If the argument does not implement either of these interfaces, it must implement java.io.Serializable and it will then be passed by value.

OptimizerCallback interface

public interface OptimizerCallback extends EventListener {

  void setPercentageComplete(int pct);

  void error(OptimizerException ex);

  void solveComplete(float solution);  
}

The callback API has a method for the Optimizer to set the percentage complete, report an error during the solve() process (remember that the solve() method is asynchronous so it cannot throw an exception directly) and finally the solveComplete(..) callback to inform the client that the solve is complete along with the solution.

OptimizerService implementation

public class OptimizerServiceImpl implements OptimizerService {

  private OptimizerCallback callback;
  private volatile boolean cancelled = false;


  private static Log LOG = LogFactory.getLog(OptimizerServiceImpl.class);

  public void registerCallback(OptimizerCallback callback) {
      LOG.info("registerCallback() called ...");
      this.callback = callback;
  }


  public void solve() {
      LOG.info("solve() called ...");
      float currentSolution = 0;

      //simulate long running solve process
        for (int i = 1; i <= 100; i++) {            try {
              currentSolution += i;
              Thread.sleep(1000);
              if (callback != null) {
                  callback.setPercentageComplete(i);
              }
              if (cancelled) {
                  break;
              }
          } catch (InterruptedException e) {
              System.err.println(e.getMessage());
          }
      }
      callback.solveComplete(currentSolution);


  }

  public void cancel() throws OptimizerException {
      LOG.info("cancel() called ...");
      cancelled = true;
  }
}

The solve() method sleeps for a while and makes the call setPercentageComplete(..) on the callback registered by the client. The code is pretty self explanatory here.

Optimizer Application context - optimizerContext.xmlWe now need to export this service using Lingo Spring Remoting. The typical Lingo Spring configuration as described in the Lingo docs and samples is :

xml version="1.0" encoding="UTF-8"?>


<beans>
    <bean id="optimizerServiceImpl" class="org.sanjiv.lingo.server.OptimizerServiceImpl" singleton="true"/>

    <bean id="optimizerServer" class="org.logicblaze.lingo.jms.JmsServiceExporter" singleton="true">
        <property name="destination" ref="optimizerDestination"/>
        <property name="service" ref="optimizerServiceImpl"/>
        <property name="serviceInterface" value="org.sanjiv.lingo.common.OptimizerService"/>
        <property name="connectionFactory" ref="jmsFactory"/>
    bean>

    
    <bean id="jmsFactory" class="org.activemq.ActiveMQConnectionFactory">
        <property name="brokerURL" value="tcp://localhost:61616"/>
        <property name="useEmbeddedBroker">
            <value>truevalue>
        property>
    bean>

    <bean id="optimizerDestination" class="org.activemq.message.ActiveMQQueue">
        <constructor-arg index="0" value="optimizerDestinationQ"/>
    bean>
beans>

In this example, I'm embedding a JMS broker in the Optimizer process. However you are free to use an external JMS broker and change the JMS Connection Factory configuration appropriately.

Note : The above optimizerContext.xml it the typical configuration in the Lingo docs/examples
but is not the ideal configuration. It has some serious limitations which I'll cover in a bit
along with the preferred  "server" configuration.

OptimizerServer The "main" class that exports the OptimizerService simply needs to instantiate the "optimizerServer" bean in the optimizerContent.xml file.

public class OptimizerServer {

  public static void main(String[] args) {
      if (args.length == 0) {
          System.err.println("Usage : java org.sanjiv.lingo.server.OptimizerServer ");
          System.exit(-1);
      }
      String applicationContext = args[0];


      System.out.println("Starting Optimizer ...");
      FileSystemXmlApplicationContext ctx = new FileSystemXmlApplicationContext(applicationContext);

      ctx.getBean("optimizerServer");

      System.out.println("Optimizer Started.");


      ctx.registerShutdownHook();
  }
}

The ClientIn order for the client to lookup the remote OptimizerService, we need to configure the client side Spring application context as follows : Client Application Context - clientContext.xml

xml version="1.0" encoding="UTF-8"?>


<beans>
    <bean id="optimizerService" class="org.logicblaze.lingo.jms.JmsProxyFactoryBean">
        <property name="serviceInterface" value="org.sanjiv.lingo.common.OptimizerService"/>
        <property name="connectionFactory" ref="jmsFactory"/>
        <property name="destination" ref="optimizerDestination"/>

        
        <property name="remoteInvocationFactory" ref="invocationFactory"/>
    bean>

    
    <bean id="jmsFactory" class="org.activemq.ActiveMQConnectionFactory">
        <property name="brokerURL" value="tcp://localhost:61616"/>
    bean>

    <bean id="optimizerDestination" class="org.activemq.message.ActiveMQQueue">
        <constructor-arg index="0" value="optimizerDestinationQ"/>
    bean>

    <bean id="invocationFactory" class="org.logicblaze.lingo.LingoRemoteInvocationFactory">
        <constructor-arg>
            <bean class="org.logicblaze.lingo.SimpleMetadataStrategy">
                
                <constructor-arg value="true"/>
            bean>
        constructor-arg>
    bean>   
beans>

Now all a client needs to do to is obtain a handle of the remote OptimizerService by looking up the bean "optimizerService" configured in clientContext.xml.

OptimizerCallback implementationBefore going over the sample Optimizer client code, lets first write a sample implementation of the OptimizerCallback interface - one which the client will register with the remote Optimizer by invoking the registerCallback(..) method.

public class OptimizerCallbackImpl implements OptimizerCallback {

  private boolean solveComplete = false;
  private OptimizerException callbackError;
  private Object mutex = new Object();


  public void setPercentageComplete(int pct) {
      System.out.println("+++ OptimzierCallback :: " + pct + "% complete..");
  }

  public void error(OptimizerException ex) {
      System.out.println("+++ OptimzierCallback :: Error occured during solve" + ex.getMessage());
      callbackError = ex;
      solveComplete = true;
      synchronized (mutex) {
          mutex.notifyAll();
      }
  }


  public void solveComplete(float soltion) {
      System.out.println("+++ OptimzierCallback :: Solve Complete with answer : " + soltion);
      solveComplete = true;
      synchronized (mutex) {
          mutex.notifyAll();
      }
  }


  public void waitForSolveComplete() throws OptimizerException {
      while (!solveComplete) {
          synchronized (mutex) {
              try {
                  mutex.wait();
                  if (callbackError != null) {
                      throw callbackError;
                  }
              } catch (InterruptedException e) {
                  e.printStackTrace();
                  break;
              }
          }
      }
  }
}

OptimizerClient

public class OptimizerClient {

  public static void main(String[] args) throws InterruptedException {


      if (args.length == 0) {
          System.err.println("Usage : java org.sanjiv.lingo.client.OptimizerClient ");
          System.exit(-1);
      }

      String applicationContext = args[0];
      FileSystemXmlApplicationContext ctx = new FileSystemXmlApplicationContext(applicationContext);

      OptimizerService optimizerService = (OptimizerService) ctx.getBean("optimizerService");
      OptimizerCallbackImpl callback = new OptimizerCallbackImpl();


      try {
          optimizerService.registerCallback(callback);
          System.out.println("Client :: Callback Registered.");

          optimizerService.solve();
          System.out.println("Client :: Solve invoked.");

          Thread.sleep(8 * 1000);
          System.out.println("Client :: Calling cancel after 8 seconds.");


          optimizerService.cancel();
          System.out.println("Client :: Cancel finished.");
          //callback.waitForSolveComplete();

      } catch (OptimizerException e) {
          System.err.println("An error was encountered : " + e.getMessage());
      }
  }
}

The test client registers a callback and calls the asynchronous method solve(). Note that the solve method in our sample OptimizerService implementation takes ~100 seconds to complete. The client then prints out the message "Client :: Solve invoked.". If the solve() call is indeed invoked asynchronously by Lingo under the hoods, this message should be printed to console immediately and not after 100 seconds. The client then calls cancel() after 8 seconds have elapsed.

Here's the output when we run the Optimizer Server and Client

Notice that the solve method has been called asynchronously and after 8 seconds the client makes the cancel() call however the server does not seem to be receiving this call and continues with its setPercentageComplete(..) callback.

I asked this question on the Lingo mailing list but did not get a response. This misbehaviour was pretty serious because what this meant was that while an asynchronous call like solve() was executed asynchronously by the client, the client was not able to make another call like cancel() until the solve() method completed execution on the server... which defeats the purpose of a method like cancel().

Lingo and ActiveMQ are open source so I rolled up my sleeves and ran the whole thing through a debugger. Debugging multithreaded applications can get tricky but after spending several hours I was able to get the to bottom of this issue.

Recollect that we exported the OptimizerSericve using the class org.logicblaze.lingo.jms.JmsServiceExporter in optimizerContext.xml. On examining the source, I found that this class creates a single JMS Session which listens for messages on the configured destination ("optimizerDestinationQ" in our example) and when messages are received, it invokes a Lingo listener which does the translation of the inbound message into a method invocation on the exported OptimizerServiceImpl service object.

The JMS spec clearly states

A Session object is a single-threaded context for producing and consuming messages.
...
It serializes execution of message listeners registered with its message consumers.

Basically a single JMS Session is not suitable for receiving concurrent messages. I understood why the cancel() method wasn't being invoked until the solve() method completed. But this behavior still didn't make sense from an API usage perspective.

Fortunately Spring 2.0 added support classes for receiving concurrent messages which is exactly what we need (yep, Spring rocks!). There are a few different support classes like DefaultMessageListenerContainer, SimpleMessageListenerContainer, and ServerSessionMessageListener .

The ServerSessionMessageListenerContainer "dynamically manages JMS Sessions, potentially using a pool of Sessions that receive messages in parallel". This class "builds on the JMS ServerSessionPool SPI, creating JMS ServerSessions through a pluggable ServerSessionFactory".

I tried altering optimizerContext.xml to use this class optimizerContextPooledSS.xml

xml version="1.0" encoding="UTF-8"?>


<beans>
    <bean id="optimizerServiceImpl" class="org.sanjiv.lingo.server.OptimizerServiceImpl" singleton="true">
    bean>

    <bean id="optimizerServerListener" class="org.logicblaze.lingo.jms.JmsServiceExporterMessageListener">
        <property name="service" ref="optimizerServiceImpl"/>
        <property name="serviceInterface" value="org.sanjiv.lingo.common.OptimizerService"/>
        <property name="connectionFactory" ref="jmsFactory"/>
    bean>

    <bean id="optimizerServer" class="org.springframework.jms.listener.serversession.ServerSessionMessageListenerContainer">
        <property name="destination" ref="optimizerDestination"/>
        <property name="messageListener" ref="optimizerServerListener"/>
        <property name="connectionFactory" ref="jmsFactory"/>
    bean>
   
    
    <bean id="jmsFactory" class="org.activemq.ActiveMQConnectionFactory">
        <property name="brokerURL" value="tcp://localhost:61616"/>
        <property name="useEmbeddedBroker">
            <value>truevalue>
        property>
    bean>

    <bean id="optimizerDestination" class="org.activemq.message.ActiveMQQueue">
        <constructor-arg index="0" value="optimizerDestinationQ"/>
    bean>
beans>

Unfortunately the behavior was still the same - cancel() was not executing on the server until solve() completed. I posted this question on the Spring User list but did not get a response. This class uses the ServerSessionPool SPI so I'm not sure if there is a problem with the Spring class, the ActiveMQ implementation of this SPI or something that I've done wrong.

Anyway I was able to successfully configure the DefaultMessageListenerContainer class and observed the desired behavior. In contrast to ServerSessionMessageListenerContainer, DefaultMessageListenerContainer "creates a fixed number of JMS Sessions to invoke the listener, not allowing for dynamic adaptation to runtime demands". While ServerSessionMessageListenerContainer would have been ideal, DefaultMessageListenerContainer is good enough for most use cases as you'd typically want to have some sort of thread pooled execution on the server anyways.

optimizerContextPooled.xml

xml version="1.0" encoding="UTF-8"?>


<beans>

    <bean id="optimizerServiceImpl" class="org.sanjiv.lingo.server.OptimizerServiceImpl" singleton="true">
    bean>

    <bean id="optimizerServerListener" class="org.logicblaze.lingo.jms.JmsServiceExporterMessageListener">
        <property name="service" ref="optimizerServiceImpl"/>
        <property name="serviceInterface" value="org.sanjiv.lingo.common.OptimizerService"/>
        <property name="connectionFactory" ref="jmsFactory"/>
    bean>

    <bean id="optimizerServer" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
        <property name="concurrentConsumers" value="20"/>
        <property name="destination" ref="optimizerDestination"/>
        <property name="messageListener" ref="optimizerServerListener"/>
        <property name="connectionFactory" ref="jmsFactory"/>
    bean>

    
    <bean id="jmsFactory" class="org.activemq.ActiveMQConnectionFactory">
        <property name="brokerURL" value="tcp://localhost:61616"/>
        <property name="useEmbeddedBroker">
            <value>truevalue>
        property>
    bean>

    <bean id="optimizerDestination" class="org.activemq.message.ActiveMQQueue">
        <constructor-arg index="0" value="optimizerDestinationQ"/>
    bean>

beans>

Note : Although some Lingo examples have the destination created as a Topic(ActiveMQTopic)
with the org.logicblaze.lingo.jms.JmsServiceExporter class, you must use a Queue when
using multiple JMS sessions for concurrent message retreival as a Topic will be received
by all listeners which is not what we want.

Here's the result when using applicationContextPooled.xml

You can download the complete source for this here and run the sample server and client. JRoller doesn't allow uploading .zip files so I've uploaded the sample as a .jar file instead. The source distribution has a Maven 1.x project file. To build, simply run "maven". To run the optimizer sever without pooled JMS listeners, run startOptimizer.bat under dist/bin/. To run with pooled JMS listeners, run startOptimizerPooled.bat and to run the test client, run startClient.bat

I am using this architecture to provide a remote API for our C++ optimizer. The C++ optimizer has a thin JNI layer which loads the Spring application context file and the OptimizerServiceImpl has a bunch of native methods which is tied to the underlying C++ optimizer functionality using the JNI function RegisterNatives(). Do you Lingo? I'd like to hear how others are using Lingo/Spring Remoting.

Monday, March 12, 2007

在ubuntu下配置vsftp。

唉，这个小东西真烦人。我在网上找了个贴子，跟着试了试，可以用，就直接贴过来了。
http://linux.hiweed.com/node/1080

Wednesday, February 28, 2007

软件整合测试阶段的版本管理。

这里说的软件整合测试阶段指的是开发组发布了第一个可以测试的版本之后的版本管理，有几点需要澄清：1）不是说软件测试是从第一个版本发布之后才开始，软件测试是伴随着整个软件开发的生命周期，当然也不一定是指对代码的测试，也可以包括对文档的review。 2）为什么这里说软件整合测试？因为在”软件测试的艺术“这本书中，系统测试指的是非功能性的测试，所以我用整合测试来表示对系统功能的测试（不是单元测试）。
想起前段时间组织的对安徽通彩网项目的测试。很混乱，测试团队和开发团队缺乏有效的沟通，很大原因上也是因为版本管理的混乱，当大家在说到一个bug的时候，这个bug并没有定义在一个有效的软件版本下，我们总是说最新版本，但是很显然很多bug并不是在最新的版本中。如果我们总是说最新版本，这其实就表示了我们根本没有版本管理。此外，虽然在测试团队里面采用了跌代的测试方式，但是开发团队里面却并没有一个明确的开发过程定义。实际上，测试团队采用什么样的测试过程应当取决于开发团队的开发过程。如果开发团队也是一个跌代的开发过程，这表示每个跌代开发团队都会发布一个可测试的软件版本，很自然的，测试团队应该也采用跌代的测试方式。在安徽通彩网的测试中，我定义了每三天为一个测试迭代，在这个迭代中所有的测试用例需要被重新执行，测试环境需要被重新构建，而且软件版本实际上是由测试团队来定义的。。。我现在能说的就是，好歹测试团队遵循了一个测试过程，有这个过程比没有这个过程要好。
昨天看了一份文档‘Revision Control with Subversion’，其中关于branch的那一段里面定义了一个软件发布阶段的版本管理过程，在这个过程中branch扮演重要角色（在这份文档中定义了几种branch pattern，其中包括release branch, feature branch），这种branch被定义为release branch，现在直接摘录。
Here's where version control can help. The typical procedure looks like this:
• Developers commit all new work to the trunk. Day-to-day changes are committed to /trunk: new features, bugfixes, and so on.
• The trunk is copied to a “release” branch. When the team thinks the software is ready for release (say, a 1.0 release), then /trunk might be copied to /branches/1.0.
• Teams continue to work in parallel. One team begins rigorous testing of the release branch, while another team continues new work (say, for version 2.0) on /trunk. If bugs are discovered in either location, fixes are ported back and forth as necessary(不管是在trunk或者branch中发现了bug，这个bug的解决都应该被merge到另外一方). At some point, however, even that process stops. The branch is “frozen” for final testing right before a release.
• The branch is tagged and released. When testing is complete, /branches/1.0 is copied to /tags/1.0.0 as a reference snapshot. The tag is packaged and released to customers(版本的发布应该由开发团队和测试团队共同定义，比如测试团队声明所有测试都通过了，那么开发组就可以发布一个新版本).
• The branch is maintained over time. While work continues on /trunk for version 2.0, bugfixes continue to be ported from /trunk to /branches/1.0. When enough bugfixes have accumulated, management may decide to do a 1.0.1 release: /branches/1.0 is copied to /tags/1.0.1, and the tag is packaged and released(就算软件已经正式发布，仍然有可能发现bug，这个bug被解决以后会发布一个bugfix的版本).

Sunday, February 25, 2007

关于性能测试

现在来说，我认为性能测试是个专业性很强的话题，不是随便一个人拿个测试工具就可以做的了的，因为这样测试出来的结果很可能和真实的结果相差很远，最终就是误导开发人员，导致开发人员在错误的方向上耗费大量的精力。就我来说，我觉得现在关于性能测试需要搞清楚两点：1）什么是性能测试？有哪些方式？ 2）性能测试应该测试哪些指标，如何分析测试结果。
前些天找到了一个专门讨论软件测试的论坛www.sqaforum.com,里面有一个关于性能测试的目录，看来有很多人和我一样对于性能测试的很多东西都不清楚，尤其是纠缠于performance test, load test, stress test之间的关系，我转贴了一段得到认同的回帖：
BASIC DEFINITIONS

This is an excerpt from my forthcoming book on performance and load testing.

While there is no universal consistency in how people use terms like performance test and robustness test, I can say that the definitions provided here are as much in the mainstream as any others.

The Definition of Performance Testing

The purpose of performance testing is to measure a system’s performance under load. As Humpty Dumpty said, a word can mean whatever one chooses it to mean, so it is worth our time to examine what we mean by the words “measure”, “performance” and “load”.

Performance testing is a measurement of performance characteristics, although sometimes the use of the word “testing” confuses people. Some performance professionals feel strongly that it is important to not use the term “performance testing”, but to call it performance measurement instead. They are concerned that this measurement will get confused with feature testing and debugging, which it is not. They point out that measurement is only testing if the collected measurements are checked against pre-established goals for performance, and that measurement is often done without preconceptions of required performance.

These people have a good point: clarity of terminology is important. But since most people use the term “performance testing” we will go with the majority and use it too.

The term performance can mean response time, throughput, availability, error rate, resource utilization, or another system characteristic (or group of them), which we are interested in measuring. “All promise outruns performance.” Ralph Waldo Emerson

Performance testing simulates the typical user experience under normal working conditions. The load is a typical, representative mix of demands on the system. (And, of course, there can be several different representative loads -- the work load at 2 p.m., at 2 a.m., etc.) Another name sometimes used for a performance test is a capacity test, though there is a minor difference in these terms as we will see later.

First, the performance testers need to define what the term performance means in a specific test situation -- that is, what the objectives are and what we need to measure in the test. The answer to this question is that we measure performance usually as a weighted mix of three characteristics of a system: throughput, response time and availability. In real-time systems, for example, the users need a guarantee that a task will always be completed within a fixed time limit. Performing a task correctly but a millisecond too late could literally be fatal.

The term load simply means the mix of demands placed on a system while we measure its performance and robustness characteristics. In practice, most loads vary continually, so later we will address the challenge of determining the most appropriate load(s) for testing. The terms work load and benchmark are sometimes used as synonyms for load. A benchmark usually means a standard load, one used to compare the performance of systems, system versions, or hardware environments, but the benchmark is not necessarily the actual mix of demands at any one user installation. The term work load is a synonym for a load, and you see both of the terms in this book: they are interchangeable.

Definition of Load Testing

In contrast to a performance test, a load test is a measurement of performance under heavy load: the peak or worst-case conditions. Because loads can have various sizes, more precise terms for this type of testing are peak-load testing or worst-case-load testing.

A performance test usually is done with a typical, representative load, but this measurement may not tell us much about the system’s behavior under heavy load. For example, let’s assume that the peak load on a system is only 15% more than the average load. The system performance may degrade gracefully – the system runs 15% slower at peak load. Often, though, the performance under load is non-linear: as the load increases by a moderate amount (in this case, 15%), the response time does not increase by a comparable percentage but instead becomes infinite because the system fails under the increased load.

Definition of Stress Testing

A stress test is one which deliberately stresses a system by pushing it beyond its specified limits. The idea is to impose an unreasonable load on the system, an overload, without providing the resources which the system needs to process that load.

In a stress test, one or more of the system resources, such as the processor, memory, or database I/O access channel, often “maxes out” and reaches saturation. (Practically, saturation can happen at less than 100% of the theoretical usable amount of the resource, for many reasons.)

This means that the testware (the test environment, test tools, etc.) must be sufficiently robust to support the stress test. We do not want the testware to fail before we have been able to adequately stress the system.

Many bugs found in stress testing are feature bugs which we cannot see with normal loads but are triggered under stress. This can lead to confusion about the difference between a feature bug and a stress bug. We will address this issue in the upcoming section entitled: “Testing Performance and Robustness versus Features”.

Some testers prize stress testing because it is so fruitful in finding bugs. Others think it is dangerous because it misdirects projects to fix irrelevant bugs. Stress testing often finds many bugs, and fixing these bugs leads to significant delays in the system delivery, which in turn leads to resistance to fixing the bugs. If we find a bug with a test case or in a test environment which we can’t connect to actual use, people are likely to dismiss it with comments like: "The users couldn’t do that.", “.. wouldn’t do that” or “... shouldn’t do that.”

Stress, Robustness and Reliability

Although stress, robustness and reliability are similar, the differences among them mean that we test them in related but different ways.

We stress a system when we place a load on it which exceeds its planned capacity. This overload may cause the system to fail, and it is the focus of stress testing.

Systems can fail in many ways, not just from overloading. We define the robustness of a system by its ability to recover from problems; its survivability. Robustness testing tries to make a system fail, so we can observe what happens and whether it recovers. Robustness testing includes stress testing but is broader, since there are many ways in which a system can fail as well as from overloading.

Reliability is most commonly defined as the mean time between failure (MTBF) of a system in operation, and as such it is closely related to availability. Reliability testing measures MTBF in test mode and predicts what the system reliability will be in live operation.

Robustness and reliability testing are discussed in the companion volume to this book, entitled “System Robustness Testing”.

Ross

---------------

这里基本上是说load test, stress test都是performance test的特例，更多的信息可以参考http://www.qaforums.com/Forum2/HTML/000724.html, 还有一个很有个实际指导意义的资源：http://www.codeplex.com/PerfTesting

Wednesday, January 31, 2007

Understanding Transaction Isolation

自从开始作为一个developer, 就可以是接触和了解事务隔离级别(transaction isolation level) , 但是坦白说我还没有真正的理解过这个概念, 总是停留再一种似懂非懂的境界. 期间也去网上找过很多资料, 但是都是搜索中文网页, 老实说, 国内的网站技术文章太少了 ,而且都是一篇文章所有网站抄来抄去, 很悲哀到了现在还是似懂非懂, 难道要糊涂的生, 然后糊涂的死么?
昨天, 当我现在在这里写这些文字的时候,我一直在回想, 昨天是什么促使我突然又开始关注事务隔离级别这个话题. 想不起来了, 昨天晚上公司年夜饭, 喝了不少红酒, 还喝了一杯白酒, 回家之后倒头就睡,过去的记忆仿佛都在昨天被抛弃了. 就是说想不起来了, 想不起来也无所谓, 这件事情还是要做的, 所以今天又google了一把, 关键字是transaction isolation level, 嘿嘿,找到了一些资源, 先看看. 写这些文字的时候, 我还根本没看那些找到的文章, 有点颠倒是非. 先写道这里, 看了之后, 然后再说我是否已经真正理解了事务隔离级别这个概念. 理解了我便要在这里写下我的理解.
下面这篇文章不错, 我决定直接copy过来.
--------------------------

Understanding Transaction Isolation

http://www.expresscomputeronline.com/20040426/techspace01.shtml

Article summary

This is a topic related to SQL Server. However, the concept is applicable to any RDBMS. While auditing many applications, I have found that incomplete or no understanding of ‘Isolation Level’ leads to a lot of real life problems. These include performance degradation, blocking, locking as well as major deadlocks.

This article provides an easy-to-understand view of what ‘Isolation Levels’ really mean and when to use which level.

‘Isolation Level’ is a setting that decides how data which is a part of an ongoing transaction is made visible to other transactions.

Transactions are well understood. So what is the problem?

As all of us know, it is a unit of work. The work may contain many steps but either all steps happen or all steps don’t happen. This is fairly well known. Nothing new here. But consider how transactions occur in real life.

I am updating one row in a Begin Tran and Commit section. I have issued Begin Tran as well as the Update statement. Commit is not yet issued – because I want to perform some more actions in the same transactions on some other tables. If some other user wants to read this row that I have updated but not yet committed, what should happen?
Consider another scenario. I start a transaction. I calculate a total of a field based upon all rows in a table. Now, I need to add a new record in another table which contains this total. Now, can the original table be changed by some other user after I calculate the total? In which case, there could be a mismatch. Do you want to take such chances?
Another scenario. I am working on some transaction table between a range of keys within a transaction—say 10 and 20. There were only 5 records when I read the range – 10, 12, 14, 16, 20. Now I am working on other things in the transaction. Before I could commit the transaction, someone added another row with a key value of 11. Now, my base assumption about what records I read between 10 and 20 and further work upon them itself is wrong. Problem!!! Is it not?
I start reading a long table. It is not a transaction at all. But other users want to refer to that table for updating some fields in specific rows. The query takes 20 minutes to read all the rows. What happens to other users who are trying to update the rows? Do they wait for 20 minutes? Or they are allowed to update the rows even when the rows are being read in a large query? What if the query was used to generate a summary report containing grand totals? The total would be wrong because after the summation started, some rows have changed. Some of these rows could have changed after the summation occurred. What’s to be done now?

As you can see all these situations are confusing and prone to inaccuracies. To avoid such problems we have a feature called “Isolation Levels”.

Isolation Levels are applicable to transactions. These decide the visibility of information which is a part of an ongoing transaction.

In very simple terms Isolation Levels decide “What happens when some data being referred to (for reading or writing) within an incomplete transaction is also being referred to (for reading or writing) from another connection (or user – actually it is called another ‘transaction lock space’)?”

To make things sound technical, all these problems have been given nice and complex sounding names.

Data visibility problems that can occur during a transaction

Let us understand some jargon.

Transaction Lock Space:

Each transaction performs certain operations (select / insert / update / delete) on one or more rows of one or more tables. Transaction starts with Begin Tran command and ends with Commit. If unsuccessful, it ends with Rollback command. Now, after Begin Tran is issues and before Commit is issued, multiple tables may participate in the transaction related commands. Specific parts of these tables need to be locked during this phase to ensure that other users do not interfere with this transaction. This is called Transaction Lock Space.

Uncommitted (dirty) data:

Consider this code snippet (Blue text is code; green indicates comments):

1. Begin Transaction

—we want to change the customer status from active = “yes” to active = “no”

2. Update customer Set active = “No”

where CustomerID = 2324

—some more commandss

3. Select * from customer where CustomerID = 2324

4. Commit Transaction

Now, if the transaction commits, the value will be “No”. If it does not commit (for whatever reason), the value will remain “Yes”.

Consider that the value has already been changed to “No”. But there are more commands to be executed before the entire transaction commits. These commands are time consuming and take, say, 3 minutes. During these three minutes, if some user outside the transaction lock space reads the value of Active field in the Customer table for ID 2324, what should they see? “Yes” or “No”? The answer is simple. The value is changed but not committed. Therefore, external queries should still show “Yes”.

Now what would happen if a Select active from Customer where CustomerID – 2324 returns “No” to another user? What happens if the transaction rolls back? This is called Uncommitted Data. Ideally this should not be visible outside the transaction.

However, consider the same command executed within the transaction (line 4). It should – and it will return “No”.

Now let us consider various problems that can occur. The problems can be of three types:

1. Dirty Read

2. Non-repeatable read

3. Phantom rows

- Dirty reads

This is when connections outside the transaction space can read the uncommitted data. You don’t want this to happen in most cases. The only reason when you may want to allow this is when you are creating a report which assumes a certain amount of inaccuracy. You think this situation is rare? Not really. Suppose the report compares the current month sale with last month sale and returns a percentage difference. Consider that the report takes 5 minutes to generate. During these 5 minutes, more transactions may be getting added to the sales data. Typically 5 transactions would get added to the sales table. The average transaction value is Rs. 1000. The total sale for the month is typically 30-40 lacs. In such cases, you really don’t need to worry about absolute accuracy. You can intentionally allow dirty reads because the impact is really not significant from a business perspective.

- Non-repeatable read

Consider the above code. We change the value of Active to “No” in line 2. Now in line 4 you expect the value to be “No” because we are querying the row ‘within’ the transaction.

Now what would happen if some other transaction was allowed to change the same value to, say, “Maybe”? In line 4, we would expect value of “No” but actually get value of “Maybe”. This problem is called non-repeatable read. The idea is that within a transaction, the same data read any number of times should yield consistent results. If it is not same then the reads are ‘non-repeatable’.

- Phantom Rows

This was explained in the introduction (range 10 to 20 example). The concept is simple. If you have already read a range of data based upon a key value, another transaction inserts a new row which happens to have a value between this range. The original range which was referred to would now become invalid. This is because a “Phantom row” was added. This is also a major problem.

Important learning

Although the above three issues are listed as problems, they may not always be considered as problems! Paradoxical? Not really. As explained for Dirty Read, whether it is a problem or not depends entirely upon the business context. Technology exists to allow these problems to occurs as well as prevent them. The choice is yours. I know this complicates matters. But if these matters were not complicated, what are IT professionals paid for!

Now let us see how to prevent these problems? The answer is “using Isolation Levels”.

Types of Isolation Levels

To solve the problem of …	The Isolation Level should be…
Dirty Read	Read Committed (Default of SQL Server)
Dirty Read and Non-Repeatable Read	Repeatable Read Non-Repeatable Read
Dirty Read and Non-Repeatable Read and Phantom Rows	Serializable
To retain all three problems	Read Uncommitted
To retain all three problems	Read Uncommitted

Before we discuss each isolation level, let us understand how to set or change it in T-SQL.

Syntax

SET TRANSACTION ISOLATION LEVEL
{ READ COMMITTED
| READ UNCOMMITTED
| REPEATABLE READ
| SERIALIZABLE
}

Remember:

- This setting applies to all transactions which are happening within a single connection.

- Multiple connections can have different Isolation Levels.

- Within a single connection, you can change the Isolation Level for specific transactions.

- You must know the kind of transaction you are performing and its business requirements before you can decide the right isolation level.

- Isolation level does not just affect the current transactions in a given connection. It also affects how other transactions issued within other connections will behave.

- Therefore, you should have a good idea of what kind of concurrent transactions occur in your application when the system is live and many users are active.

This is the single biggest problem which leads to low performance, locking and blocking, as well as deadlocks

- Most developers think of individual transactions as though they are occurring in a single user system.

- It is difficult to envisage all permutations of possible transactions which can occur together. But it is very much possible to plan for it in a structured and informed manner. This requires additional effort, monitoring live systems and tweaking of specific transactions and so on.

- Unfortunately, this additional effort is rarely a part of system deployment! Such problems surface only after the system load increases in such a way that small, unnoticed issues become amplified due to large volume of data and / or large number of concurrent users.

- Only one of the options can be set at a time.

- It remains in effect till you explicitly change the option by issuing another command. This is an important aspect to consider.

Best Practice : When you change the Isolation Level for a transaction, remember to set it back to the original level after the transaction is completed.

Isolation levels

Read Uncommitted

This is as good (or bad) as not having any isolation. All data which is uncommitted is readable from any connection. This should not be used unless you have a very good reason to do so.

Read Committed

This prevents dirty reads. This does not prevent phantoms or non-repeatable reads. This is the default. Although it is the default, it does not mean that this isolation level is ideal for all transactions. One of the regular tendencies amongst techies is to simply use default values without giving it a second thought! I cannot imagine the number of phantom and non-repeatable reads that must be occurring in the world because someone simply used the ‘default’ value. It is a scary thought to say the least.

This level is obviously more restrictive than the ‘Read Uncommitted’ level.

Repeatable read

This prevents dirty reads as well as non-repeatable reads. It does not prevent phantom rows. This is more restrictive than Read Committed level. When I say restrictive what does it mean? It means that the chances of other transactions having to wait for this one to finish are INCREASED. Another way of saying this is – Repeatable Read reduces concurrency compared to Read Committed level.

Serializable

This is the most restrictive of the options. This should never be used as the default level. If you use this one as the default, it will most probably create a single user system!

This prevents all three problems.

How do Isolation Levels work?

You will notice that we have not yet addressed this issue at all in this article. The answer to this question is another Pandora’s box. The answer is simple. It uses the appropriate types of locks to achieve the desired effects. We will see what locks are and how they are implemented using Isolation Levels in the next article.

In the meantime, please go through your code and see if you are simply using the default level or are there some transactions which merit a different isolation level. If you feel like changing a level, never do it in production system. It can create havoc. Try it out in a test environment first, satisfy yourself that it has no side effects and then implement it in production system.

Don’t stop there. Monitor the production system. Some problems are never detected in test systems. They may occur in production. Handle them as required.

If you have a packaged product which handles large amount of data and many concurrent users, you must analyse the appropriate usage of Isolation Levels.