Performance testing is one of the most important practices associated
with applying software performance engineering principles to acceptance testing
and other quality assurance processes. This blog entry discusses the key role scalability
models play in performance testing. A scalability model hypothesizes a
relationship between a performance-oriented response time or throughput goal
for a scenario and one or more scalability dimensions
associated with that specific scenario. One such scalability dimension, the arrival rate of customer requests, is a
staple of performance stress tests, for example.
One of the main assertions here is the value of making
explicit the scalability assumptions that are implicit in defining performance
acceptance testing criteria. Assumptions about the dimensions of an
application’s scalability inform the definition of an appropriate performance test matrix to ensure the
testing process is both rigorous and effective. Furthermore, feedback from empirical
measurement data derived from performance tests should be used to validate the
scalability assumptions that were formulated. This is especially important
because, frequently, the scalability assumptions made at the outset of a
project are often incomplete. Validating the model against the measurements
generated by the performance tests can even show it is not accurate enough for
its intended purpose, which is essentially to assess the quality of the
application.
Performance testing
Performance tests are frequently used to verify that new
software releases meet quality assurance goals, including the so-called non-functional business requirements
associated with application responsiveness. In building responsive web sites
and creating scalable applications of all sorts, these requirements are
understood to be crucial in attaining higher levels of customer satisfaction.
In acceptance testing, elaborate performance benchmarks are
often constructed. These benchmarks are executed to determine the hardware
configuration needed to meet capacity requirements for a new or revised system,
based, for example, on the expected number of concurrent users of those new or
enhanced program features. Experienced performance engineering specialists
learn to use a variety of commercial tools to generate synthetic performance
benchmarks and execute them, involving the definition of the test scenarios to execute,
facilities for ramping up the number of simulated user interactions, etc.
In short, in order to ensure the quality of a new release of
a software application, it is a common practice in performance engineering to
develop and execute synthetic performance tests. The goal of this effort is to predict
reliably the performance and scalability of that application prior to deploying
it in production. To that end, many software development organizations expend
considerable time and expense developing and executing performance tests to try
to ascertain whether a new software release meets its quality objectives.
I am not, however, primarily concerned here with stress
testing designed to aid in capacity planning, although scalability models should
and often do play an important role in decisions about which test scenarios to automate
and exercise. Instead, the focus here is on performance timing tests, which are
designed to execute and measure scenarios to ensure their performance meets their
response time objectives. Performance timing tests are also associated with
regression testing, ensuring that subsequent changes to the application do not
undermine the service levels obtained in a prior release.
The process of developing performance tests includes determining
what specific aspects of an
application to test. This is challenging for any complex application with a
myriad of access paths and interaction scenarios. Here is where a scalability model
is most helpful. Consider a database query and reporting scenario that returns
a large number of results that must be sorted. Understanding that sorting is a
numerical operation that is sensitive to the number of items in the result set suggest
that timing tests be developed for the scenario that test the assumption that
response time scales appropriately with the size of the result set the query
returns. More generally, a scalability model explicitly identifies the major factors
impacting application performance, and timing tests need to be formulated along
the same dimensions as the model.
After deciding what application-oriented scenarios need
testing, the next step is to formulate a response time objective or goal for
these scenarios. The scalability model is also very useful in this context because
the response time goals need to be realistic in light of the processing
resources required to execute the scenario. Finally, the performance test itself
needs to be developed – in effect a timing test for the scenario – instrumented,
and executed. For example, executing a series of timing tests that show
response times for the database query and reporting scenario increasing
exponentially with the size of the query result set raises an alarm. The
application as delivered is not performing as expected. Maybe the sorting
utility that was implemented needs re-visiting.
Frequently, empirical measurements from timing tests executed
with the current system establish a baseline set of expectations for the level
of performance customers will tolerate in the new release. In the absence of
any other predictive criteria, current performance levels establish a baseline
of expectations that a new release must meet. It is also important to verify
that new features and functions in a future release do not inadvertently undermine
the performance levels that customers are accustomed to experiencing in the
current release. Identical timing tests run against the old and the new system can
verify that subsequent changes or additions have not produced significant
performance regressions in legacy features.
The baseline expectations associated with the current system
are relevant because studies in Human Factors engineering show that people
accustomed to using the current system do not perceive response time improvements
unless they are of sufficient magnitude. Steve Seow’s “Engineering Time” is an excellent
summary of the Human Factors research in this area. Seow’s guidance for
software developers is that customers usually do not perceive response time
improvements in the application that are less than 20% better than the old
system. (Similarly, response time degradation less than 20% is not liable to be
noticed either.)
In summary, a critical success factor in developing
performance tests that, unfortunately, is quite often under-emphasized in
practice, is the development and validation of a suitable scalability model for the application under test. The scalability
model dictates what specific aspects of the application’s performance need to
be investigated during testing. It is also important to regard the scalability
model assumptions as contingent until validated by specific test results. The
absence of a suitable scalability model to guide performance testing can spell
doom for that effort.
What is a scalability model? A scalability model is conceptual in nature, providing a working hypothesis that describes the factors that contribute to the response times of the scenarios exercised in the performance tests. The scalability model implicit in stress testing a web-based application, for example, by successively increasing the arrival rate of customer requests assumes there is a relationship between the response time for web requests and the number of concurrent requests, namely
RT = f(N)
where N is the
number of concurrent requests. This is a scalability model that is derived from
mathematical queuing theory, where the relationship between response time and
the number of customers is modeled by a non-linear
function, such as the one that characterizes a simple M/M/n open queuing model:
RT = ST + QT = ST +
(ST * u / (1-u))
where RT, ST, and QR are the Response Time, the Service
Time, and the Queue Time, respectively, and u
is the utilization of some bottlenecked resource the web request requires. As
the utilization of this resource approaches its capacity limit, u ð
100%, the response time for servicing those requests at the bottlenecked
resource spikes characteristically, as illustrated in Figure 1. (Note, that at 100% utilization, the function reduces
to a singularity and the result is undefined.)
Where ever the relationship
hypothesized in an M/M/n queueing model holds, you should be able to execute a
series of performance tests, steadily increasing the arrival rate of customer
requests until a response time spike is evident, manifesting a performance
bottleneck that constrains performance.
The value of even simple queuing model equations, such as
the one illustrated in Figure 1, is that they derive analytically behavior that
you are apt to observe in real world applications. They “explain” why it is
unrealistic to expect the performance of complex applications to scale linearly
with the number of concurrent customers, for example, and why a configuration
where utilization at each resource is “load-balanced” avoids queuing
bottlenecks and produces optimal performance. Fundamentally, this correspondence between what a model
predicts and the actual behavior observed is why model building matters in
performance analysis.However, for complex, real-world applications, there is risk in adopting simple models that are apt to be too simple to be completely trustworthy. After all, the arrival rate of customer requests is only one likely dimension of scalability. Consider an electronic mail or messaging application, for example, like Hotmail or Outlook. The size of the message being sent, the number of recipients, the size and number of attachments, the network bandwidth available, as well as the distance of the various recipients from the e-mail server are all factors likely to impact message delivery time, the key performance-oriented indicator of service quality. In addition, some less obvious factors may be necessary before the behavior predicted by the model resembles the actual, ancillary factors such as the number of recipient mailboxes that are configured.
The same considerations apply in performance testing. During
acceptance testing, measurements must be taken along each relevant axis of
scalability to ensure that application response goals are being met. Two
dimensions of scalability – the number of customers (n) and the complexity of their database queries (C), for example –
define a performance testing matrix that is n
* C wide. If a third scalability dimension is required, such as m, the number of search criteria used in
the database query, the number of test cases necessary increases combinatorially
to n
* m * C. The effectiveness of any sequence of performance tests that are
executed gauges the coverage of that
sequence of tests against the underlying model of that application’s
scalability. If the tests do not accurately assess performance against any of
the relevant scalability factors, there is no reason to trust that they will accurately
predict actual performance levels when the application is ultimately deployed.
The desirability of a model that is simple, but not too
simple is encapsulated in Occam’s razor, in effect, that the simplest
explanation among any that conform to the observational measurements is
preferred. This balancing act inherent in distilling complex behavior into a
simplified conceptual model is often discussed in mathematical circles with
reference to an old joke, which goes something like the following:
One evening a man
leaves a bar and sees a drunk on his hands and knees in the bright arc of a nearby
streetlight. “Can I help you,” he volunteers. “Sure. I am looking for my car
keys,” the drunk slurs in response.
After a few minutes of
futile groping on the ground, the first man asks, “Are you sure this is where
you dropped them?” The drunk replies, “No, actually I dropped them over there.
But I am looking over here because the light’s better.”
Scientists involved in building conceptual models of complex
real-world behavior are reconciled to the fact that they are limited to
searching in areas where there is ample empirical measurement data to
illuminate the phenomenon requiring an explanation. Consequently, conceptual
models that are abstracted from empirical data are always regarded as
contingent, subsequent to acquiring observational data that contradicts these
assumptions.
Similarly, in performance investigations, we are often
constrained to searching only in the areas illuminated by the measurement data we manage
to gather. There are frequently aspects of performance that we do not have sufficient
measurement data to understand. We are simply too much in the dark about these
areas to a render a reasonable judgment about them. Moreover, implicit scalability
assumptions are frequently embedded in the measurement tools that are employed,
determining what measurements are available and what measurements are not. Unfortunately,
performance engineers are often reduced to looking for a solution in aspects of
the system where there is sufficient instrumentation.
More on the contingency of models and the way implicit scalability assumptions are often embedded in our measurement tools in the next post in this series.
Comments
Post a Comment