Skip to main content

Analyzing HTTP network traffic: Why is this web app running slowly, Part 7.

This is a continuation of a series of blog entries on this topic. The series starts here.

Since HTTP is a wire protocol built on top of TCP/IP, the network packet sniffer technology that is widely used in network diagnostics and performance optimization is readily adapted to measuring web browser Page Load time. Network sniffers like WireShark can intercept and capture all the HTTP traffic and are typically configured to gather related network events, such as DNS look-ups. It is easy to get overwhelmed with all the information that these network diagnostic tools provide. Often software developers prefer network tools that are more focused on the HTTP protocol and the page composition process associated with assembling the DOM and rendering it in the browser. The Developer Tools that ship with the major web browsers include performance tools that measure Page Load time and help you diagnose why your page is slow to load. These tools work by analyzing the network packets sent and received in reply by the web client.

Many web application developers prefer the developer tools for diagnosing performance problems in web applications that can be found in Google’s Chrome, even though many authorities think recent versions of Microsoft’s Internet Explorer are keeping pace. Now that Steve Souder is working in the team that develops the Chrome developer tools, Chrome has also added a tool called PageSpeed, which is functionally very similar to the original YSlow application. If you are using Chrome as a browser, you can navigate to the Developer tools by clicking the “Customize and control Chrome” button on the right hand edge of the Chrome menu bar. Then select the “Tools” menu and then click “Developer Tools.” PageSpeed is one of the menu options on Chrome's Developer Tools menu.

Let’s take a look at both PageSpeed and the network-oriented performance tool from the suite of Developer Tools that come with Chrome, using the US Amazon home at page as an example. We will also get to see the degree to which, the most popular e-commerce site in the US (according to, embraces the YSLow performance rules.

Figure 6 shows a view of the Chrome PageSpeed tool after I requested an analysis of the home page. The first observation is that PageSpeed does not espouse the simple YSlow prime directive to “Make fewer HTTP requests.” This is a huge change in philosophy since Souder originally developed YSlow, of course, reflecting some of the concerns I mentioned earlier with how well the original YSlow scalability model reflects the reality of modern web applications.

Figure 6. Using Chrome’s PageSpeed tool to understand why the home page takes so long to load.

While the rule set might vary a bit, reflecting Souder’s wider and deeper experience in web application performance, but otherwise PageSpeed is identical in operation to YSlow. The Chrome version of the tool re-loads your page and inventories the DOM after the page has been re-built. In the example PageSpeed screenshot, I focused on one of the important and tuningPageSpeed rules that has an identical counterpart in YSlow, namely “minimize the request size”. PageSpeed improves upon the usability of YSlow by clearly identifying the HTTP GET requests that triggered the rule violation. Here PageSpeed reports the 4 requests that generated Response messages that exceeded 1500 bytes and thus required more than one network packet from the web site in response. 

You may also notice that the four Requests shown that violate the “minimum request size” rule are directed at three different web sites, one for a resource available from the main web, two from (likely an internal CDN with responsibility for serving files to customers living in North America), as well as an advertisement served up from, a company that serves up display ads that is owned by Google. It is common practice for web pages to be composed from content that needs to be assembled from multiple sources, including the advertising that is served up by third party businesses like DoubleClick. While it is fashionable to assert that “information wants to be free,” the harsh reality is that developing and maintaining the enormously complex hardware and software environments that power Internet-based web applications is extremely expensive, and advertising revenue is the fuel that sustains those operations. Advertising revenue from web page viewing (and clicking) has also made Google one of the two or three most profitable Tech industry companies in the world. 

For a glimpse at a performance tool that does report Page Load time measurements directly, click on the “Network” tab on the Chrome Developer tool menu bar, which is shown in Figure 7. The Network view of Page Load time contains an entry for each GET Request that was performed in order to render the full home page and shows the time to complete each operation. You will notice at the bottom of the window, Chrome shows that a total of 315 GET Requests were issued for various image files, style sheets, and JavaScript code files in order to render the Amazon Home page. In this instance, with effective use of the cache to render the Amazon Home page, the browser only took about 4.3 seconds to complete the operation. The overall Page Load time is displayed at the lower left of the window border. (When the browser cache is “cold,” loading the Amazon Home page can easily take one minute or more.)

The Timeline column at far right presents the web page composition process in time sequence, a view that has become known as a waterfall diagram. The Chrome waterfall diagram for Page Load time features a pop-up window that breaks out the time it took to load each individual component of the page. We can see that the initial GET Request to returns a Response message that is about 78 KB, a payload that has to be broken into more than 50 individual packets. In the pop-up window, we see that the browser waited for 142 milliseconds before the first packet of the HTTP Response message appeared. It then took 1.55 seconds for the remaining 50 or so packets associated with that one Response message to be received. These are measurements derived from monitoring the network traffic that HTTP GET Requests and Response messages initiate.

Figure 7. A waterfall diagram in Chrome PageSpeed showing the page composition process in time sequence for the Amazon home page. The browser issued a total of 315 GET Requests to build the page. In this instance, the page composition process, aided by a warm cache and parallel download sessions, took just 2.35 seconds.

The initial HTTP Response message from the Amazon Home page serves as a kind of exoskeleton that the browser gradually fills in from the Response messages generated by the remaining 314 subsequent GET Requests that are referenced. The HTML standard permits page loading to proceed in parallel, and, as noted above, browsers can generate parallel TCP sessions for loading static content concurrently. In this instance, of course, many of the HTTP objects are available from the cache because I had just loaded the same page immediately prior to running the PageSpeed tool. 

About ten lines down in the example, there is a GET Request for a 23.6 KB image file that required 164 ms to complete. The Timeline column pop-up that breaks out the component load time indicates a separate DNS lookup that took 36 ms and TCP session connection sequence that took 25 ms. This means an embedded Request for a URL that was directed to a different Amazon site. The browser then waited 32 ms for the initial Response message. Finally, it shows a 27.5 ms delay spent in the Receiving state, since the 23.6 KB Response message would require multiple packets. Because the browser supports parallel TCP sessions, this Request does not prevent the browser from initiating other Requests concurrently, however.

Page composition that requires content stored on multiple web sites, parallel browser sessions, JavaScript blocking, and the potential for extensive DOM manipulation when the JavaScript executes are features of web applications that complicate the simple YSlow scalability model derived earlier. Incorporating these additional features into the model yields a more realistic formula:

Page Load time = Browser Render time +
Script execution time +
((Round trips * RTT)/Sessions)
[equation 5]

While equation 5 models the web page composition better, it is still not definitive. We have already discussing some of the added complications, including 
  • RTT is not a constant factor when GET Requests to multiple web sites are issued to compose the complete page and when the impact of the browser cache and a CDN are factored into the equation, 
  • JavaScript download requests cannot proceed in parallel, and
  • JavaScript code executing inside a handler attached to the DOM’s window.load event may further modify the DOM and effectively defer the usability of the page until the script completes, but none of that script execution time is included in the Page Load Time measurements.

Before wrapping up this discussion of the limitations of the YSlow scalability model, let’s return full circle to the original web application (illustrated back in Figure 1 in the very first post in this series) that was running slowly, motivating me to explore using a tool like YSlow to figure out why. Figure 8 shows a Chrome PageSpeed waterfall diagram for the page composition process for the data-rich ASP.NET web application in the case study. (Figure 9 tells a similar story using the comparable Internet Explorer performance tool.)

Figure 8. Waterfall diagram from Chrome PageSpeed showing individual GET Requests and their delays, each of which contributes to the Page composition time. In this example for the case study app, the delay associated with resolving the initial ASP.NET GET Request accounts for about 75% of the overall page load time.

Chrome’s PageSpeed tool indicates that the DailyCharts.aspx web page took about 1.5 seconds to load, requiring a total of 20 GET Requests. These measurements were captured in a single system test environment where the web browser, IIS web server, and the SQL Server backend database were all available on the same machine, so that network latency was minimal. Crucially, generating on the web server a Response message from the original GET Request that runs to 1.5 MB and then transferring that Response message to the web client alone accounted for about ¾ of the overall web application’s Page Load Time. In addition, GET Requests for the two high resolution charts, which are rendered on the server as .jpg files and then referenced in the original Response message, yielded Response messages of about 275 KB and 100 KB in size.

Moreover, since Chrome PageSpeed re-executed the web page, the database queries that were executed on the web server benefitted substantially from a warm start in the SQL Server cache. When I subsequently investigated how long these database queries could take, I noted that a query like the one illustrated that interrogates voluminous Process level counter data stored in the repository accessed several hundred thousand rows of data, that were then subject to sorting to select a return a result set containing the five busiest processes during a 24 period. Without the benefit of a SQL Server cache warm start, database query execution time alone could be on the order of 15-30 seconds.

Figure 9. A similar view of Page Load time using Internet Explorer’s developer tools. In this instance, I also installed an ASP.NET oriented performance tool called glimpse in the web site in order to help diagnose the performance issues. The additional requests associated with Glimpse delay page load time by another 150 ms.

Figure 9 represents another view of the page composition process, this time using Internet Explorer's version of the waterfall timing diagram, which again shows a 1.44 MB Response message generated by the ASP.NET app in response to the initial ASP.NET web GET Request. Internet Explorer reports that it required 1.38 seconds to generate this Response message and transmit it to the web client. (Note that that in this test environment, the web client, the IIS server and the SQL Server back-end database all reside on the same machine, so network latency is minimal – I measured it at less than 10 microseconds.) 
The initial GET Request Response message contains href links to the high resolution charts that were rendered on the web server as .jpgs. Resolving these links for a 235 KB main chart and a 84 KB secondary chart also impact page load time, but these file requests are able to proceed in parallel, at least.

In both Figure 8 and 9, resolving the initial ASP.NET GET Request clearly dominates the page composition process. These waterfall views of the web page composition process for this web application place the YSlow recommendations for improving page load time performance, illustrated back in Figure 2, in a radically different perspective. Instead of worrying about making fewer Http GET Requests, I needed to focus on why the server-side processing to generate a Response message was taking so long. In addition, I also wanted to understand why the Response message associated with the GET Request was so large, requiring almost 1.5 MB of content to be transferred from the web server to the web client, and what steps could be taken to trim it down in size. Unfortunately, the web application performance tools like YSlow are basically silent on the subject of the scalability of any of your server-side components. These need to be investigated using performance tools that run on the web server.

Ultimately, for the case study, I instrumented the ASP.NET application using Scenario.Begin() and Scenario.End() method calls, which allowed me to measure how much time was being spent calling the back-end database to resolve the GET Request query. I wound up re-writing the SQL to generate the same result set using a fraction of the time. Since the database access logic was isolated in a Business Objects layer, that was a relatively straightforward fix that I was able to slip into the next maintenance release. But that quick fix still left me wondering why the initial Response messages were so large, which I investigated by using the ASP.NET page trace diagnostics facility to examine the amount of ViewState data being passed to build the various Machine and Chart selection menus. One of the menu items referenced “All Machines” for every Windows machine defined in the repository, which was a red flag right there. Addressing those aspects of the server-side application required a significant re-engineering of the application, however, which was work that I completed last year.

To conclude my discussion of the case study app, I found that the YSlow-oriented performance tools did highlight the need for me to understand why the server-side processing associated with generating the initial Response message was taking so long and also spurred me to investigate why such large Response messages were being generated. The specific grades received from applying the YSlow performance rules to DOM were not particularly helpful, however. To resolve the performance issues that I found required using traditional server-side application performance tools, including the SQL Explain facility for understanding the database queries and the ASP.NET diagnostic trace that showed me the bloated contents of the ViewState data that is transmitted with the page to persist HTML controls between post back requests. (A post back request is a subsequent GET Request to the web server for the same dynamic HTML web page.) It turned that much of the ViewState data embedded in the initial Response message generated by the the ASP.NET app was supporting the page's menu controls.)

In practice, any web application that generates dynamic HTML that requires ample amounts of server-side resources to build those Response messages faces scalability issues with its server-side components – the database back-end, the business logic layer, or the web server front end. It should be evident from the discussion so far that performance tools like YSlow that execute on the web client and focus on the page composition process associated with the DOM are silent on any scalability concerns that may arise on the web server.


Popular posts from this blog

“There’s a lot more to running a starship than answering a lot of fool questions.”

Continuing a series of blog posts on “expert” computer Performance rules, I am reminded of something Captain James T. Kirk, commander of the starship Enterprise, once said in an old Star Trek episode: “There’s a lot more to running a starship than answering a lot of fool questions.” Star Trek, The Original Series. Episode: The Deadly Years. Season 2, Episode 12. See For some reason, the idea that the rote application of some set of rules derived by a domain “expert” can suffice in computer performance analysis has great sway. At the risk of beating a dead horse, I want to highlight another example of a performance Rule you are likely to face, and, in the process, discuss why there is a whole lot more to applying it than might be obvious at first glance. There happens to be a lot more to computer performance analysis than the rote evaluation of some set of well-formed performance rules. It ought to be apparent by now that I …

How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)\% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…