Skip to main content

Complications facing the YSlow scalability model. (Why is this web app running slowly? Part 5)

This is a continuation of a series of blog entries on this topic. The series starts here.

With the emergence of the Web 2.0 standards that enabled building dynamic HTML pages, web applications grew considerably more complicated, compared to the static view of the page composition process that the YSlow scalability model embodies. Let’s review some of the more important factors that complicate the (deliberately) simple YSlow model of Page Load Time. These complications include the following:

  • YSLow does not attempt to assess JavaScript execution time, something that is becoming more important as more and more complicated JavaScript is developed and deployed. JavaScript execution time, of course, can vary based on which path through the code is actually executed for a given request, so the delay can vary, depending on the scenario requested. The processing capacity of the underlying hardware where the web client that executes the script resides is also a factor. This is especially important when the web browser client is running inside a cell phone, since apps on cells phones are often subject to significant capacity constraints, including battery life, RAM capacity (which impacts cache effectiveness), and lack of external disk storage.
In addition to the variability in script execution time that is due to differences in the underlying hardware platform, the execution time of your web page’s JavaScript code can potentially vary based on the specific path a scenario takes through the code. Both of these factors suggest adapting conventional performance profiling tools for use with JavaScript executing inside the web browser. Using the tools at, it is also possible to measure when the web page is "visually complete," which is computed by taking successive video captures until the video image stabilizes. 
  • Web browser provide multiple sessions such that static content, where possible, can be downloaded in parallel. Current web browsers can create 4-8 TCP sessions per host for parallel processing, depending on the browser. Souders’ blog entry here compares the number of parallel sessions that are created in each of the major web browsers, but, unfortunately, that information is liable to be dated. These sessions persist while the browser is in communication with the web server, so there is also a savings in TCP connection processing whenever connections are re-used for multiple requests. (TCP connections require a designated sequence of SN, SYN-ACK and ACK packets to be exchanged by the Sender and Receiver machines prior to any application-oriented HTTP requests being transmitted over the link.) Clearly, parallel TCP sessions play havoc with the simple serial model for page load time expressed in Equations 3 & 4. 

Parallel browser sessions potentially so improve download performance that web performance experts like Souders have reassessed the importance of the YSlow rule recommending having fewer static HTTP objects to load. Another browser-dependent complication is that some of the browsers block parallel sessions in order to download JavaScript files. Since script code, when it executes, is likely to modify the DOM in a variety of ways, for the sake of integrity it is better to perform this operation serially. This behavior of the browser underlies the recommendation to reference external JavaScript files towards the end of the page’s HTML markup, when that is feasible, of course, which maximizes the benefit of parallel downloads for most pages.

The fact that the amount of JavaScript code embedded in web pages is on the rise, plus the potential for JavaScript downloads to block parallel downloading of other types of HTTP objects, has Souders and others recommending asynchronous loading of JavaScript modules. For example, Souders writes, “Increasing parallel downloads makes pages load faster, which is why [serial] downloading external scripts (.js files) is so painful.”[1] As a result, Souders recommends using asynchronous techniques for downloading external JavaScript files, including using the ControlJS library functions to control how your scripts are loaded by the browser. (See also for native JavaScript code examples.)

The basic technique defers loading external JavaScript until after the DOM’s window.onload event has fired by adding some code to the event handler to perform the script file downloads at that point in time. The firing of the window.onload event also marks the end of Page Load Time measurements. Using this or similar techniques, measurements of Page Load time can improve without actually improving the user experience, especially when the web page is not fully functional until after these scripts files are both downloaded and executed. It is not uncommon for one JavaScript code snippet to conditionally request the load of another script, something that also causes the browser to proceed serially. Moreover, the JavaScript execution time can result in considerable delay. Suffice to say that optimizing the performance of the JavaScript associated with the page is a topic for an entire book.
  • Equation #3 showing the YSlow scalability model provides a single term for the round trip time (RTT) for HTTP requests when round trip time is actually more accurately represented as an average RTT where the underlying distribution of round trip times is often non-uniform. Some of the factors that cause variability in round trip time include:
    •  content is often fetched from multiple web servers residing at different physical locations. Locating different web servers initially requires separate DNS look-ups to acquire the IP address of each server and establishing a TCP connection with that server, 
    • content may be cached at the local machine via the browser cache or may be resident at a node physically closer to the web client if the web site uses a caching engine or Content Delivery Network (CDN) like Akamai.
In general, caching leads to a highly non-uniform distribution of round trip times. Those Http objects that can be cached effectively exhibit one set of round trip times, based on where the cache hit is resolved (local disk or CDN), while objects that require network access yield a different distribution. A pattern that is encountered frequently is an HTML reference to a third party Ad server, which seems like it is often the last and slowest HTML reference to be resolved. The Ad servers from Google, Amazon and others not only know who you are and where you are (in the case of a phone that is equipped with GPS), they also have access to your recent web browser activity so they are likely to have some knowledge of what advertising content you might be interested in. Figuring out just what is the best ad to serve up to you at any point time may necessitate a substantial amount of processing back at that 3rd party ad server site, all of which delays the generation of the Response message the web page is waiting on.

  • Many web sites rely on asynchronous operations, including AJAX techniques, to download content without blocking the UI. AJAX is an acronym that stands for Asynchronous JavaScript and XML, and it is a collection of techniques for making asynchronous calls to web services, instead of synchronous HTTP GET and POST Requests. Since the tool does not attempt to execute any of the web page’s JavaScript, any use of AJAX techniques on the page is opaque to YSlow.

AJAX refers to the capability of JavaScript code executing in the browser to issue asynchronous XMLHttpRequest methods calls to web services, acquire content, and update the DOM, all while the web page remains in its loaded state. AJAX techniques are designed to try to hide network latency and are sometimes used to give the web page interactive features that mimic desktop applications. A popular example of AJAX techniques in action are textboxes with an auto-completion capability, familiar from Search engines and elsewhere. As you start to type in the textbox, the web application prompts you to complete the phrase you are typing automatically, which says you some keystrokes.

A typical autocompletion textbox works as follows. As the customer starts to type into a textbox, a snippet of JavaScript code grabs the first few keyboard input characters and passes that partial string of data to a web service. The web service processes the string, performs a table or database look-up to find the best matches against the entry data, which it returns to the web client. In a JavaScript callback routine that is executed when the web service sends the reply, another snippet of script code executes to display to modify the DOM and display the results returned by the web service. These AJAX callback routines typically do update the DOM, but the asynchronous aspect of the XMLHttpRequest means the state of the page remains loaded and the all the input controls on the page remain in the ready state. To be effective, both the client script and the web service processing the asynchronous call must handle the work very quickly to preserve the illusion of rapid interactive response. To ensure that the asynchronous process is executed smoothly, both the XMLHttpRequest and the web service response messages should be small enough to fit into single network packets, for example. 

Nothing on this short list of its major deficiencies diminishes the benefits of using the YSlow tool to learn about why your web page might be taking too long to load. These limitations of the YSlow approach to improve web page responsiveness do provide motivation for tools to augment YSlow by actually measuring and reporting page load time. I will take a look at those kinds of web application performance tools next.


Popular posts from this blog

How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)\% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here

“There’s a lot more to running a starship than answering a lot of fool questions.”

Continuing a series of blog posts on “expert” computer Performance rules, I am reminded of something Captain James T. Kirk, commander of the starship Enterprise, once said in an old Star Trek episode: “There’s a lot more to running a starship than answering a lot of fool questions.” Star Trek, The Original Series. Episode: The Deadly Years. Season 2, Episode 12. See For some reason, the idea that the rote application of some set of rules derived by a domain “expert” can suffice in computer performance analysis has great sway. At the risk of beating a dead horse, I want to highlight another example of a performance Rule you are likely to face, and, in the process, discuss why there is a whole lot more to applying it than might be obvious at first glance. There happens to be a lot more to computer performance analysis than the rote evaluation of some set of well-formed performance rules. It ought to be apparent by now that I …

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…