Skip to main content

Monitoring SQL Server: the OS Wait stats DMV

This is the 2nd post in a series on SQL Server performance monitoring, emphasizing the use of key Dynamic Management View. The series starts here:

OS Waits 

The consensus among SQL Server performance experts is that the best place to start looking for performance problems is the OS Wait stats from the sys.dm_os_wait_stats DMV. Whenever it is running, the SQL Server database Engine dispatches worker threads from a queue of ready tasks that it services in a round-robin fashion. (There is evidently some ordering of the queue based on priority –background tasks with lower priority that defer to foreground tasks with higher priority.) The engine records the specific wait reason for each task waiting for service in the queue and also accumulates the Wait Time (in milliseconds) for each Wait reason. These Waits and Wait Time statistics accumulate at the database level and reported via the sys.dm_os_wait_stats DMV.

Issuing a Query like the following on one of my SQL Server test machines:

Select Top 25 wait_type, waiting_tasks_count, wait_time_ms from sys.dm_os_wait_stats 
     Where waiting_tasks_count > 0 
     Order by waiting_tasks_count DESC

produces a result set something like the following:

wait_type
waiting_tasks_count
wait_time_ms
MEMORY_ALLOCATION_EXT
858270868
505937
WRITELOG
19720750
79441149
LOGMGR_QUEUE
17132347
179626053
PREEMPTIVE_XE_CALLBACKEXECUTE
5173489
3840
ONDEMAND_TASK_QUEUE
4302846
179030341
DIRTY_PAGE_POLL
1780089
180572654
RESERVED_MEMORY_ALLOCATION_EXT
1110294
686
PAGELATCH_EX
948687
8156
SP_SERVER_DIAGNOSTICS_SLEEP
787653
180300487
SLEEP_TASK
660128
457787629
HADR_FILESTREAM_IOMGR_IOCOMPLETION
360566
180575666
SOS_SCHEDULER_YIELD
231832
2417
LAZYWRITER_SLEEP
196423
180532338
PREEMPTIVE_XE_SESSIONCOMMIT
139809
9021
ASYNC_NETWORK_IO
118411
9767
BROKER_TO_FLUSH
88093
90288232
XE_TIMER_EVENT
62978
180556635
PREEMPTIVE_OS_QUERYREGISTRY
51491
2774
PREEMPTIVE_XE_TARGETINIT
46804
1814
PREEMPTIVE_XE_TARGETFINALIZE
46800
500
SQLTRACE_INCREMENTAL_FLUSH_SLEEP
45134
180556132
REQUEST_FOR_DEADLOCK_SEARCH
36113
180570615
QDS_ASYNC_QUEUE
19638
179547616
CHECKPOINT_QUEUE
15606
180529961
PAGEIOLATCH_SH
15505
3885
The units of the Wait time column, wait_time_ms, represent milliseconds of accumulated delay time. Note that the counter values reflect cumulative counts over the entire lifetime of that execution of the SQL Server process. With the exception of some fields that are explicitly labeled as Maximum values, this is common to all the DMV counters. (The sys.dm_os_wait_stats View contains a column labeled max_wait_time_ms.) The maximum fields maintain global maximum values observed over the lifetime of the SQL Server instance. The execution of a 

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

SQLPERF CLEAR command resets all the counter fields to zero values.

The first aspect of this OS Wait data that requires further explanation is the wait_type identifier field. The Transact-SQL documentation of the View includes a section that enumerates the Wait reasons and sometimes provides additional explanatory text. If you navigate there, you will learn that HADR_FILESTREAM_IOMGR_IOCOMPLETION Wait occurs when "(t)he FILESTREAM Always On I/O manager is waiting for I/O completion" or the MEMORY_ALLOCATION_EXT Wait occurs, according to the doc, when a thread is waiting "while allocating memory from either the internal SQL Server memory pool or the operation (sic) system," all of which may or may not prove helpful in diagnosing a performance problem. Presumably, the prefix "HADR" refers to the Engine's High Availability Driver threads, which can block in a variety of ways. Some entries in the doc, like the one for the DIRTY_PAGE_POLL Wait reason, contain no further explanation. Fortunately, the name itself is reasonably self-explanatory. A worker thread that constantly checks for “dirty” (i.e, updated in memory, but not yet hardened to disk) database pages that require being written asynchronously to the disk is subject to waiting in the Scheduler queue. This background worker thread is known internally as the Dirty Page polling thread. Whenever it is waiting to be dispatches, its wait time accumulates in this set of buckets.

I would hazard a guess that any of the voluntary SLEEP waits, like the SLEEP_TASK one, are not significant from a performance standpoint. The SQL Server Engine puts a worker thread to sleep when its queue of work items is empty. Similarly, an XE_TIMER_EVENT Wait also little performance impact – these figure to be periodic Waits associated with tasks associated with xEvent-related timer events. (Turning on certain classes of high volume xEvents, on the other hand, can have a very significant performance impact.) 

Calculating the Average Wait Time per Wait suggests that these “PREEMPTIVE” Waits are not very consequential either. The average wait time per Wait for preemption Waits on this machine is an inconsequential handful of microseconds. On the other hand, the average Wait Time calculation for the HADR_FILESTREAM_IOMGR_IOCOMPLETION Wait reason yields about 500 ms per Wait event. This is consistent with an obviously IO-related Wait. I observed the Physical Disk IO time counter for IO response time (Avg. Disk secs/Transfer in Perfmon) over the same interval and saw that disk responses times were also in the neighborhood of 500 milliseconds per IO. (Making an educated guess that this poor disk response time was due to slow write operations to an SSD drive, I plugged a conventional USB disk into the portable machine I use for a desktop and migrated the SQL Server databases to that new disk. Since re-locating the database files, my SQL Server timing benchmark runs, which are heavily write-oriented, now execute almost ten times faster.)

Wait Categories

The documentation references hundreds of Wait reasons, many of which are obscure to even expert SQL Server DBAs. For instance, there are a slew of Lock waits, wait_types that all begin with the characters "LCK_". Database locking has the potential to cause very serious performance problems, but because DBAs understandably might not know the difference between an Intent Update lock versus an Intent Exclusive lock, it is standard practice to aggregate all Lock-related Wait reasons into one category of Lock waits. To aggregate the Lock wait reasons, you could use the Like keyword in the Where clause to perform a generic Search of the View:

Select SUM(waiting_tasks_count) as Waits, SUM( wait_time_ms) as WaitTime 
from sys.dm_os_wait_stats Where wait_type Like 'LCK_%'

Similarly, SQL Server experts like Ozar recommend aggregating all the OS Wait stats into general Wait Reason categories, for example, grouping all the Lock waits together, the Latch waits together, the IO-related waits, all the waits related to the preemptive Scheduler, etc., together. Aggregating the Wait statistics by Wait category results in some loss of detail, but yields a more economical representation of the data, definitely a practical concern if you are considering storing the Wait statistics longer term for analysis and reporting.

That is also the approach the new Query Store facility adopts, aggregating and storing the Wait time by Wait category at an individual Query Plan level. The Wait time that accumulates each Query Store measurement interval (which defaults to 15 minutes) is aggregated by Wait category at the Query level and stored in the sys.query_store_wait_stats Table, one of several Query Store-specific Tables that are defined and maintained whenever the Query Store is enabled. Notice that these are Tables where the measurement data is stored permanently, not memory-resident Views, so the more economical representation using Wait categories of the Wait time measurements is crucial. 

The classification scheme the Query Store facility uses maps the OS Wait reasons into about 25 generic categories. If you are making your own T-SQL calls to query the OS Wait DMV, you can get a code snippet that maps the wait reasons into wait categories, courtesy of the Microsoft SQL Server Tiger Team support group. (The Tiger Team T-SQL code is available on GitHub at https://github.com/Microsoft/tigertoolbox/blob/master/Waits-and-Latches/view_Waits.sql). 

However, I can also sympathize with the approach Brent Ozar, the SQL performance guru, uses in his Blitz tools where he has decided to continue to maintain his existing set of Wait categories. There are some minor differences in the way Blitz rolls up the Wait reason statistics, compared to the Query Store. With legacy code to support, it is understandable that Ozar prefers the familiarity of his existing aggregation scheme. Me, I am just getting started in this area, so it makes sense to treat the OS Wait Reason to Wait category mappings that the Query Store uses as authoritative. In the process I have noticed several OS Wait reasons that do not map into any of the categories the Query Store uses, at least according to the documentation. I assume this is an oversight in the either the mapping scheme or the doc that will eventually be corrected.

Detecting performance problems.

It should be easy to fathom why the OS Wait data is so important when it comes to diagnosing and solving performance problems. Reducing Wait Time speed up Query execution. Targeting for improvement those specific Wait reasons that are slowing down Query execution time the most is the tuning method recommended by most SQL Server performance experts. 

For example, if query execution is being delayed by resource shortages for CPU and Memory, it makes sense to try to add more processors and/or physical machine memory for this instance of SQL Server to use. When the SQL instance is running in the cloud on Azure or AWS or resides on an on-premises virtual machine (vm), you can add processors and physical memory by granting the virtual machine (vm) additional Host machine resources. (When you provision the vm with additional memory, you must remember to look at the database instance’s Maximum server memory setting to make sure that the instance can take advantage of that additional memory.) 

On the other hand, when Buffer IO delays predominate, you will want to consider any of the IO performance tuning options that are available to you – move to faster disk devices, spread the database across more disk devices to boost thruput, reduce contention on the disks currently in use, etc. When I saw the amount of Wait Time associated with the HADR_FILESTREAM_IOMGR_IOCOMPLETION Wait Reason on the SQL Server instance I use for development, I immediately looked for a hardware solution – a faster disk.

When you make any configuration and tuning changes, you will want to refer back to these statistics to see if the change is benefitting your workload. This is where archiving a baseline set of OS Wait measurements comes in handy; comparing the current OS Wait stats collected after the configuration change to the baseline measurements lets you quantify the extent of any improvement.

The OS Wait stats can also identify database performance issues related to the way the database is deigned or the way application choose to access it. These are performance issues that cannot be relieved effectively by simply provisioning a more powerful machine. For instance, consider the case when Network IO waits predominate. Requests are sent to SQL Server using a network protocol, with replies returned over the same connection. Network IO delays can and will occur when the network interface is slow or congested. But they can also occur when the size of SQL Server Reply messages is quite large, requiring the underlying network transport – namely TCP/IP – to break them into an excessive number of smaller packets for transmission. (The effective size of an Ethernet packet is approximately 1500 bytes – so you can do the math.) The TCP/IP protocol breaks large messages into packet-sized datagrams for transmission across the network and places reassembly instructions into the packet header. TCP/IP at the Receiver endpoint is responsible for the reassembly of the original Application-layer message from the packets that were received. If a Query returns a large number of rows and columns in a result set, Network IO delays will increase due to additional time spent in the TCP/IP networking stack to convert these large Reply messages into packets suitable for transmission. Tuning queries so that they do not return such large results sets can be the most effective way to reduce the amount of Network IO delay.

The general point is that the generic name for these delays that SQL Server uses, namely OS Waits, is often a misnomer. These are worker threads delayed for reasons that include delays associated with SQL Server waiting for OS services like disk and network IO operations to complete, but are not limited to machine resource shortage delays. 

Latches and Locks 

Delays associated with the OS Wait categories of Locks and Latches usually have little to do with OS services and everything to do with your database applications. Latches are a synchronization mechanism that is used internally inside the SQL Server process address space to protect the integrity of in-memory pages, which includes the buffer pools, memory used to cache recently used Plans and data, and other shared data structures. There is more detail about what function SQL Server Latches serve in an excellent white paper from the Microsoft SQL Server support team that is available here. For example, the white paper describes how Latch contention increases in databases with a large amount of parallelism. Parallel threads running concurrently may contend for shared memory-resident resources like buffers. The white paper mentions some of the ways you can mediate these scaling problems.

Prior to the Query Store being available, you could resolve current Latch waits at the level of the session using sys.dm_os_waiting_tasks and sys.dm_exec_sessions. Beginning in SQL Server 2016, the same release that introduced the Query Store, along with the Query Store breaking down waits at the Query execution plan level, there is also a sys.dm_exec_session_wait_stats DMV that breaks out the Wait stats data at the database session level.

SQL Server uses locks to ensure the consistency of data in the database whenever it is being updated. Requests that are executing concurrently that need access to the resource that is protected by a lock are forced to wait until the Request holding the lock releases it upon completion. SQL Server issues resource locking requests constantly; however, lock wait time only accumulates when there is lock contention.

Database locking is a complex topic, and I won’t attempt more than a cursory discussion here. Understanding the sources of severe lock contention is one of the most difficult areas of database performance and tuning. Fundamentally, lock contention occurs when there are multiple concurrent Readers and Writers attempting to access the same database rows and columns. A Writer thread acquires a lock for the resource —which can be a specific row or a column entry in an Index – which it intends to update. Depending on the type of lock the Writer requests, it can block Readers from access to the resource until it has finished making its update, which can also depend on other updates that must be applied in tandem, like debiting one of your bank accounts while crediting another one when you transfer money between them. Wrapping the debit and credit operations into a single transaction can assure the consistency of both operations. Meanwhile, a Reader trying to access that same information may not care that the information it retrieves is out of date while an Update transaction is pending. 

Locking semantics for database requests are specified in SQL Server using cursors and transactions. For example, if a transaction (a designated logically related sequence of Requests that have the ACID characteristics of being atomic, consistent, isolated and durable) SQL Server requests an Update of an Indexed field using a dynamic cursor that guarantees the consistency of a repeatable Read operation, SQL Server uses an exclusive lock on that Index to prevent other Readers and Writers from accessing the range of data affected by the transaction’s scope until the transaction is completed. 

Fixing severe lock contention problems can require application changes, physical database changes, or both. The OS Wait stats quantify the delays associated with locking; less granular but other similar data on lock contention is available from existing SQL Server performance counters, using the SQL Server:Locks object. Digging into the source of lock contention problems in SQL Server often involves gathering and analyzing individual lock_acquired and lock_released xEvents. (There is a good example that shows how to do this published here.) 

The most serious class of delays associated with lock contention are deadlocks. Suppose Request A holds a Lock on resource r1, but is then blocked from continuing to execute until it can obtain a Lock on resource r2. Meanwhile Request B that holds the lock on resource r2 cannot proceed until it can obtain a lock controlling access to resource r1. In a deadlock, both Requests are blocked and no forward progress can be made by either Request until one of the Requests times out. Then when SQL Server unwinds the failed request, it will then release any locks it was holding. This is a deadlock, a condition also sometimes known as a deadly embrace. Deadlocks can be ugly, caused by a chain of Requests against interlocking resources. 

SQL Server provides Windows performance counters in the SQL Server:Locks object for tracking deadlocks at a high level, a set of DMVs for understanding which transactions are currently in flight, and a series of detailed Lock xEvents to help you unravel complicated chains of blocked Requests. The documentation for the Lock:Deadlock and Lock:Deadlock Chain xEvents is especially helpful in the context of understanding the assorted Lock Waits. The Lock:Deadlock xEvent proivides the locking mode. Lock modes range from a simple Shared request which would permit another concurrent Reader, but would block a Requestor that requires exclusive access to an Update with Intent Exclusive lock request that would pessimistically block other Readers. The xEvent documentation helpfully explains that an Update with Intent Exclusive lock request is abbreviated as LCK_M_UIX, so it serves as a useful cross-reference to the various Lock-related Waits reported in the sys.dm_os_wait_stats DMV.

The Lock:Deadlock xEvent also reports the type of resource that is being locked, essentially its scope. The scope of the lock can be as broad as an entire Database, but is typically more granular, limited to a specific Row ID (RID), a specific Key in a Table, a specific 8 KB data Page, or an extent, defined as a set of 8 contiguous data Pages. The lock resource types documented in the deadlock xEvents correspond to the Windows performance counter instance names used in reporting on lock activity in the SQL Server:Locks object. A helpful reference on SQL Server deadlocks is here.

The Lock class of xEvents even provides an event that documents when a lock escalation occurs. A lock escalation occurs when SQL Server automatically converts a series of related fine-grained locks into a single, coarser grained lock. (There is a more detailed explanation of the mechanics of SQL Server lock escalation here.) When a threshold number of locks accumulate at, for example, the Row ID level, for rows within a Table, SQL Server will automatically consolidate those individual fine-grains Locks into a single lock against the entire Table, which simplifies the sheer amount of lock management the SQL Engine must perform. The serious downside of the coarser-grained locks that are produced when lock escalation occurs is that they increase the potential for lock contention. 

Comments

Popular posts from this blog

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…

How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)\% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here