Skip to main content

Caching Strategies for High Performance -- Introduction

Caching Strategies for High Performance


In the fall of 2018 I had the opportunity for the first time to teach a seminar in Performance Engineering to graduate students at the University of Washington's Allen School for Computer Science and Engineering (CSE) in Seattle, near where I live. Conceptually, I built the class around the investigation of web application performance that I originally published on this blog beginning here. In structuring the content of the class, I was also fortunate in being able to take advantage of Andre Bondi's very readable textbook, Foundations of Software Performance Engineering. Dr. Bondi's excellent book is informed by his academic background in analytic queuing modeling, along with extensive professional experience with performance stress testing, both areas of expertise that complement my background and experience with its emphasis on instrumentation, measurement tools, and empirical studies of hardware and application performance. If you are interested in seeing more about that specific class, which I hope to be able to continue to teach, you can access the U-Wa public web site where the slide presentations I used in that edition of CSE 590 are available. The in-class lectures were live-streamed to another location across town at the Microsoft campus. For those who might be interested in sitting through them, videos are also posted, if you follow the above link.

A factor that complicates the performance and scalability of web-based applications considerably is the extensive use of caching technology -- in web applications, caching is used by the web client, the web servers, and is further amplified by the content caching engine employed at the edge of the IP network connecting the clients and web servers. Dealing with multiple levels of caching in web applications is actually quite complicated, yet is something that most web application performance tooling does not help with as much as they should. When I began to think about lecturing on the use of caching in web applications, I discovered that the documentation and training materials available that discusses the use of these different caches in a very ad hoc fashion, where the approach I try to instill is thoroughly empirical and analytic. This situation was disappointing. There is a wealth of academic literature on the subject, of course, but it tends to concentrate on a specific type of cache implementation -- i.e., the use of cache inside the processor, virtual memory management and paging -- but avoids a more general approach to the subject. Over the course of my professional career, when I worked on problems related to disk caching, memory management in various guises, or hierarchically-managed storage systems, that work was informed by principles that were originally uncovered when virtual memory and paging was introduced or processors began to incorporate micro-programming and use pipelined instruction execution which begat caching.

After finishing up with the Performance Engineering class last December, I began to assemble some of my notes on the subject into what I hope is a deep-dive into cache management for developers, performance analysts, CS students, and like-minded researchers. Following past practice, I thought I would publish some of the written material here first to make it more broadly available.

Accordingly, this is the first in a series of posts on the use of caching technology to accelerate hardware and application performance, with a focus on general principles and how they apply to processor caches, virtual memory, disk and other digital storage, databases, and multi-tiered web applications.

.   .   .

Introduction.

Caches are frequently employed to accelerate the performance of computer hardware, software, and applications. For example, CPU hardware makes extensive use of caching, and cache is fundamental to the performance of the multi-level memory hierarchy that most processor hardware employs. Caching facilities are just as important inside database management systems (DBMS) where they play a critical role in accelerating the delivery of the data those systems manage. Caches are also used to improve the performance of both file systems and the magnetic disk subsystems where data files are stored. The facilities used in modern e-mail servers that automatically archive older, inactive messages, migrating them to less expensive media for long term storage, are another example of cache techniques in action.

Despite caches being so common, guidance about how to use them effectively is often not available. For example, caches are critical to the performance of web applications to hide the latency associated with long distance network communication. Given the pervasiveness of caching in web-based applications and its influence on web application performance, it is disconcerting to find that a systematic approach to analyzing cache effectiveness is missing from the practices and procedures emphasized by many experts in web application performance.

Based on experience with cache technology in a variety of situations, some general principles emerge that apply to most situations where caching strategies are used. We focus here in understanding when and under what circumstances caches can be used effectively, as well as how to measure that cache effectiveness, topics that should be especially helpful to developers responsible for building their own caching technology. We will also encounter the cache coherence problem that arises whenever the underlying data being cached is subject to changes or updates. In addition, I will discuss many of the unique issues that arise with specific cache implementations, ranging across processor caches, virtual memory management, database management systems, and the web, emphasizing the solutions developed to address these problem areas.

What is a cache?


The dictionary defines the word cache as a storage location, derived from the French verb cacher, which means “to store” or, more specifically, “to hide.” The word is most frequently used in contexts where something of value is being hidden for use at a later time. When the term is used in the world of computer technology, “cache” retains the inference that it is a hidden storage location that is desirable to use because it is faster than the memory locations where the data resides permanently. In computer-speak, it is common to say a cache refers to an architectural feature of either hardware or software that functions as a small, fast storage facility whose access is transparent to the program or programs that access it. This means the application requesting data does not and cannot address the cache directly, but, behind the scenes, the storage subsystem automatically diverts the request to use the cached copy, if it exists. Bound together as a single entity, the back store and any cache that manages a subset of the information resident there form a seamless storage hierarchy. This series of blog posts explores such memory and storage hierarchies – their design and, crucially, their performance.

There are a number of fundamental cache management functions, usually performed by the facility responsible for managing the main storage component. These include the following:


  • determining the subset of backing store data that will reside in the cache, 
  • search, essentially finding the information that resides in the cache and retrieving it directly without having to access the back store
  • an entity replacement policy that is used manage the contents of the cache dynamically based on the frequency of access, and 
  • maintaining the integrity of the information stored in the back store when data subject to caching is written or updated 


In keeping with the design goal that operation of the cache remaining transparent to the primary application accessing the information subject to caching, these cache management functions are not normally visible directly. Cache management tasks are usually performed in the background by the same facility that is responsible for managing the back store. The management of the different caches employed in CPU hardware is a a good example of this practice. Managing the caches designed to accelerate access to information stored in the computer’s Random Access Memory (RAM) is a CPU hardware function.

Storage or memory hierarchies incorporate two (or more) kinds of storage hardware that have marked differences in their relative cost and performance into a single storage subsystem, accessed via a single, uniform addressing scheme. Latency is the term engineers use when they refer to the access speeds of different memory and storage components. The smaller, faster cache component is usually a more expensive storage component, from the standpoint of the cost per bit of digital data, than the more expansive backing store. In general, caching is designed to allow the hierarchical memory subsystem to achieve (1) performance near the level of the smaller, faster component at (2) an overall cost closer to that of the more expansive, but less expensive backing storage.

Figure 1 is a block diagram illustrating the architecture of a simple, two-level memory hierarchy. It shows a small, high speed, cache component associated with a larger, primary Object back store:

Figure 1. The architecture of a dynamically-managed, memory hierarchy, showing the primary Object storage, the smaller, faster access cache component, and a directory that maintains an inventory of the current contents of the cache.

The drawing illustrates several key aspects of a memory hierarchy that utilizes a cache. Because it is smaller, the cache contains only a proper subset of the information that resides in the back store. To support a look-up or Search function, it is necessary to understand what particular subset of the primary Object store can currently be found in the cache – the drawing shows a portion of the cache that is set aside as a directory containing the information required to inventory the current cache contents. This is a common approach, but by no means the only way to support a Search function. Another key point is that for an object that is currently stored in the cache, the cached version is identical bit for bit to the same object that resides in the primary Object store. The cache coherence problem, mentioned above, is associated with the need to maintain this equivalence in the face of requests to add or change the information stored in the hierarchy. Maintaining the data integrity of the memory hierarchy becomes more complicated when there are multiple levels of cache and/or data is distributed across multiple caches.

Cache operations. 

The operation of the cache being transparent means that the program or application accessing the back store only understands how to address the data that resides in the memory hierarchy logically without any regard for whether or not it currently physically resides in the cache. The application simply requests the data using its logical address. When the memory hierarchy can serve up the object requested directly from the cache, the operation is a cache hit. If the object requested is not found in the cache, but must be retrieved from the slower, backing store, the operation is a cache miss. The reduction in latency associated with an operation that results in a cache hit is the principal performance benefit of a cached architecture.

It is important to distinguish cache operations for Reads and for Writes, which require different handling. Writes, depending on how they are handled, also can have different service times than Reads. 

Reads. A Read request that is a cache hit references data that was previously staged to cache, providing a clear performance benefit. The cache performs a Read hit operation the fastest: there is a Search operation of some kind that determines that the data requested resides in the cache. Then, once the data requested is located, it is immediately accessed and returned. 

Compare the timing of a Read miss, where there is a similar Search operation, but this time the cache Manager fails to find the data requested in the cache. So, on a cache miss, a Read operation to the back-store is initiated to acquire the data and stage it to cache. Read misses take longer than Read hits, and Read miss processing may even take longer than directly reading from the back-store. This occurs when the unit of data that is transferred between the back-store to the cache – let’s denote this as the staging unit – is significantly larger than the amount of data actually requested. When the staging unit is larger than the unit of information specifically requested, then the staging operation needs more time to transfer the larger amount of data.

Frequently, the unit of staging that is used to transfer information between the back store and the cache is larger than the size of a typical request. Consider, for example, a Load instruction being executed by the processor that requests data for 4-bytes beginning at a specified memory location where the bits stored there will be interpreted as an integer value. When the processor determines that these 4-bytes are not currently residing in the cache, this causes the processor to fetch the entire cache line containing those specific memory locations. The processor may even need to fetch the next cache line from memory if the 4-byte field cross the cache line boundary. The processor must wait until the staging operation for the entire cache line completes before it is then able to complete the execution of the original instruction. During the staging operation, memory is accessed and the data these memory locations contain is transferred across the memory bus to the processor cache. This memory bus is a resource that can be subject to delays if there are too many concurrent requests due to contention. Similarly, with virtual memory, the operating system fetches a full page of data, normally 4K bytes, from the paging file from disk, not just the four bytes associated with an integer value stored in virtual memory that was referenced, causing the page fault. When the Load instruction references an operand in a virtual memory page that does not currently reside in physical memory, that corresponds to a cache Read miss operation for virtual memory.

Writes. Writes that are cache misses can be handled in one of two ways. The simplest approach is to apply the update directly to the back-store, ignoring the cache. This is also known as a Write-Around policy. The second method, which is called Write-through for reasons that become apparent in a moment, accesses the back store to bring the memory location requested into cache first where the update is then applied. The effect of s Write-through policy is to transform Write misses into Write hits, which are then handled in one of two different ways: Write-Through or deferred Write-Back

Essentially, for Write hits, the back-store can be updated immediately (Write-through), or, instead, a lazy write policy can be implemented that defers the back-store update until a later point in time (which corresponds to the deferred Write-Back policy). If the Write-Back policy uses lazy-write, it is OK to defer the update, but the update must ultimately be applied, otherwise the data on the back-store is inconsistent with the Requestor’s view of that data. Note that the Requestor has received status back from the storage subsystem immediately that the Write request was handled successfully, even though the update of the back-store has not yet taken place. In effect, the notification serves as a guarantee that the update will be applied to the back-store eventually. I will have lots more to say about this cache coherence problem that arises whenever it is possible for multiple Requestors to request access to a common back-store entity a in later section. In order for the cache to be truly transparent, it is necessary to maintain a single, consistent (i.e., coherent) view of the data stored in the back-store. 

Figure 2 illustrates a generic Write-back policy, without, for the moment, distinguishing the deferred Write-back cache variant. In addition, Table 1 shows how the policy used for Write misses determines the options available to use for Write hit operations. Each policy leads to different type of physical operations, all of which have performance implications, which is a topic explored further below.

Figure 2. The Write-Through cache policy applies an update to both the copy kept in cache and the back store concurrently. When the Write-through operation completes, the contents of cache matches the contents of the back-store.

When the Write-Through policy for cache hits is used, as illustrated above in Figure 2, the cache and the back store are updated tandem. One possible performance optimization is performing the cache and the back-store updates in parallel, assuming parallel paths exist that can support concurrent operations. The operation can only be considered complete when the back-store update finishes, which means the service times for Write hits under this policy are similar to that for Read misses, unless Writes to the back-store have considerably longer latency, which does occur with some storage media. Using Write-Through, once the back-store operation completes, the contents of cache continue to mirror the original data kept in the back-store. The Write-Through policy maintains the integrity of the data resident on the back-store, but it does sacrifice performance. Because the item in the cache is updated during the Write operation, any subsequent Reads or Writes against the same block can continue to be handled as cache hits. 

Write-Around applies the update directly to the back-store, but does not change the contents of the cache that are subject to the update. Instead, the cache entry that points to that slot in the cache where the data resides is flagged as invalid; if there is a subsequent Read request for the same location, this Read will miss the cache and require another staging operation to cache the item again, per the request. See Figure 3. As indicated in Table 1, Write-Around on a Write hit is usually paired with the Write-Around policy for cache misses, where, likewise, a Write that misses the cache does not trigger a staging operation.

Figure 3. The Write-Around cache policy. If the Request is a Write hit, that item in cache is flagged as invalid.

Finally, there is the deferred Write-Back approach. With deferred Write-Back, the cache entry is updated immediately, and it is flagged as a “dirty” block, indicating that the back store contains data that is no longer current. However, the back store is not updated at this time. Waiting to apply the change or update to the back-store allows the Write request to be handled as a cache hit, which results in a clear performance benefit. The “dirty” status is an indicator that tells the cache manager that before it is safe to remove the cache entry for that “dirty” block, the back-store requires an update. This update is pending until a Write operation occurs that updates the corresponding item in the back store, at which point the “dirty” status of a block in cache is cleared.

In Write-Back, the cache manager makes a guarantee that this pending update will occur at some point; otherwise, there is an unacceptable loss of integrity for the information stored in the storage hierarchy. For instance, in a shared memory multiprocessor – by far, the most common computing architecture in use today – the entire contents of memory are visible to each processor core and its dedicated set of caches. In this situation, deferring memory updates for the sake of performance must not jeopardize the integrity and consistency of the data stored in RAM. This is known in hardware circles as the cache coherence problem, which was alluded to above. Whenever a cache uses a deferred Write-Back policy, data integrity is compromised if the pending update is not properly applied.

The deferred Write-Back policy is sometimes referred to as “lazy write” or “lazy-evaluation.” The major benefit of the lazy-evaluation approach is performance under a very specific scenario, and how effective the Write-back policy is is a function of how frequently this scenario occurs. The scenario that benefits from deferred Write-Back is that a subsequent Read or Write of the same item in the cache can be handled as a hit because the data that resides in cache is current. The “lazy” aspect of lazy write provides another performance benefit when the same item (or block) is updated repeatedly. A subsequent Write request to the same item or block in the cache supersedes the original deferred update, rendering it unnecessary. This can be a significant savings when the access pattern is one that repeatedly Reads and Updates the same item. 

In fact, this pattern of Write-then-Read memory access is not uncommon. For instance, it is used with locks and latches, data structures that secure shared memory locations in order to synchronize the actions of concurrent processing threads on shared memory multiprocessors. Moreover, in file systems and databases,  checkpoint/restart and other status records that are used to guarantee data integrity for update requests use a similar access pattern. Deferred Write-Back caches outperform other approaches for both of these specific scenarios. Because these Use cases are associated with data consistency integrity, these scenarios are also ones where the guarantee that the cache provides to preserve data consistency is singularly important.

Figure 4. Deferred Write-back cache, also referred to as "lazy write."


Ensuring that a deferred update is ultimately applied is a critical aspect of the Write-Back policy. A common approach is that the write operation to update the back-store is deferred as long as possible. Deferring updates as long as possible, however, creates additional risk. For example, if write operations are pending when the command to shut down the storage subsystem is issued, it is necessary to delay the execution of a shutdown command until the cache manager has completed pushing all pending changes from the cache to the back-store. A refinement that mitigates the some of the risk associated with deferring writes indefinitely establishes a threshold for the number of pending write operations that are allowed – once this threshold is reached, the cache manager immediately initiates back-end write operations to limit the exposure to data loss.
An alternative approach is to only delay the Write to the back-store until the cache manager finds what it considers to be an opportune time to perform the operation, for instance, when the communication channel to back-store is idle and there are no other back-store requests pending or in-flight. 

Another approach used in database systems is to defer the Write no longer than a specified interval of time in order to ensure, for example, that a distributed database will eventually become consistent enough to support a repeatable Read operation. In a distributed database system where data is dispersed among nodes residing in multiple data centers, potentially distributed around the globe, it is a common practice for the service provider to issue a service level guarantee such the database will reach a consistent state within 15 minutes, for example, after an update Request has been processed at any one of the nodes. In practice, making the guarantee in contractual language is one thing, while making it work under every conceivable situation and circumstance is quite difficult, including loss of connectivity and other types of data loss failures. (In distributed database circles, this problem is discussed as the CAP Theorem, where CAP is an acronym for Consistency, Availability and Partition-tolerance. See Gilbert and Lynch's article entitled "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" first published in the SIGACT Newsletter in June 2002 to dig further into this topic.) 

Table 1.


Cache Miss Policy
Cache Hit Policy
Write-Around
Write-Through
Write-Around
þ

Write-Through

þ
Write-Back

þ

Cache Management

The objects retained in the cache are managed dynamically – i.e., the contents of the cache are
continuously adjusted, subjecting them to constant change, based on the underlying access pattern. The principle governing dynamic management of the cache contents is based on access to cache being considerably faster than access to the slower back store. Consequently, the cache will speed up overall average latency if objects that are frequently referenced are retained in the cache. Meanwhile, because the size of the cache is limited, compared to the size of the primary object storage, objects that are infrequently referenced need to be aged out of the cache. We will investigate the performance of the cache management techniques that are used to try to achieve this operational goal a little later in this series of articles. We will also discuss some known anti-patterns – specific patterns of data access that defeat caching schemes or degrade their performance.

A memory hierarchy requires some form of dynamic management of the contents simply because the storage capacity of the cache is smaller than the capacity of the backing store, often considerably smaller. Thus, the cache can only retain a subset of the objects that reside in the full inventory of the backing store. What to do when the cache is full – a condition that is inevitable given the disparity in storage capacities between the two components – and the memory hierarchy receives a request for a new object that is currently not resident in the cache requires implementing what is known as the cache replacement policy. Most caches implement a replacement policy that attempts to identify those objects in the cache that have not been used recently and replace them with newer objects. This policy is known as Least Recently Used, abbreviated as LRU, widely regarded as near optimal. We will analyze the LRU replacement policy and review several alternatives that approximate the LRU algorithm, but have more desirable run-time characteristics.

A successful cache management policy is one that is adaptive, adjusting to any changes that occur in the access pattern. The existence of object access anti-patterns that defeat a particular cache management policy is troublesome. One technique for dealing with these problematic workloads is extending the cache management policy to recognize common anti-patterns and make adjustments to the management policy accordingly. We will look at some examples of this later. Another approach is to support “hints” from the application that inform the cache manager to expect a particular pattern of access governing future requests. For the most part, where application programs can supply hints to the cache manager about its intent, the strategy has proven much less successful than its proponents expected. Some of the areas where hints falter is that such knowledge is difficult to come by, or that the hints themselves are misleading. Consider an application program that is diverted from its normal path of execution when it encounters an anomaly or an error condition that occurs infrequently. In the face of such unpredictability, dynamically managing the contents of the cache based on actual access patterns not only performs reasonably well, but is capable of recognizing changes in access behavior when they occur and adapt to them.

Another drawback to the application supplying “hints” to the cache manager is that it transgresses on the architectural principle of transparent access to the cache. Having the application supply hints to the cache manager regarding future object access patterns also creates what may be an undesirable dependency on the presence of a particular hardware or system feature. The one area where the use of application-oriented “hints” to the cache manager continues to be viable is in database access where particular Queries can be annotated to inform the DBMS about their priority and other salient aspects of the request that impact performance. For the most part, using hints to the DBMS cache manager is limited to use by Database Administrators that maintain and support an authoritative set of database access routines and themselves specialize in the architecture of a specific DBMS technology such as Oracle or Microsoft SQL Server.

More common than using hints is an application development strategy that cache-aware, recognizing just how crucial using cache effectively is in order to achieve performance objectives. After diagnosing that an under-performing application is suffering due to cache management, it is worthwhile to try and modify it to re-order object access in a manner that exploits the cache or caches that are available rather than defeating them. Compiler optimization options that control object placement on cache line boundaries are a good example of this approach. They are effective because the expertise required to use these options is confined to a few, very experienced compiler developers and senior application developers with a good understanding of the topics discussed here.   

Summary.

The size of the cache relative to the size of the backing store emerges as one of the critical factors in understanding cache performance. How much faster access to data stored in the cache is, compared to accessing data from the backing store, is another. Additional considerations that we will explore below include 
  • how data corresponding to requests that miss the cache come to be staged there for future reference, and 
  • how to handle Write requests, including updates in place applied to data already present in the cache.
In summary, the cache is the smaller, faster component of a hierarchically structured storage subsystem, where the larger, slower component of the memory hierarchy is known as the back store. Since the cache has less storage capacity than the back store, it holds only a subset of the data set that is available in the back store. An important cache management function is maintaining a directory of the subset of data that can found in the cache to support Search operations. Additional cache management functions include implementing the replacement policy that is employed when the cache is full. Finally, maintaining the integrity of the data stored in the cache when writes or updates occur is a critical function.

This is the end of Part One of this article on caching technology. 

Comments

Popular posts from this blog

Monitoring SQL Server: the OS Wait stats DMV

This is the 2nd post in a series on SQL Server performance monitoring, emphasizing the use of key Dynamic Management View. The series starts here : OS Waits  The consensus among SQL Server performance experts is that the best place to start looking for performance problems is the OS Wait stats from the sys.dm_os_wait_stats DMV. Whenever it is running, the SQL Server database Engine dispatches worker threads from a queue of ready tasks that it services in a round-robin fashion. (There is evidently some ordering of the queue based on priority –background tasks with lower priority that defer to foreground tasks with higher priority.) The engine records the specific wait reason for each task waiting for service in the queue and also accumulates the Wait Time (in milliseconds) for each Wait reason. These Waits and Wait Time statistics accumulate at the database level and reported via the sys.dm_os_wait_stats DMV. Issuing a Query like the following on one of my SQL Server test mac

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profiling tools that exploit

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe