Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems

1 ABSTRACT

The IO performance of storage devices has accelerated from hundreds of IOPS five years ago, to hundreds of thousands of IOPS today, and tens of millions of IOPS projected in five years. This sharp evolution is primarily due to the introduction of NAND-flash devices and their data parallel design. In this work, we demonstrate that the block layer within the operating system, originally designed to handle thousands of IOPS, has become a bottleneck to overall storage system performance, specially on the high NUMA-factor processors systems that are becoming commonplace. We describe the design of a next generation block layer that is capable of handling tens of millions of IOPS on a multi-core system equipped with a single storage device. Our experiments show that our design scales graciously with the number of cores, even on NUMA systems with multiple sockets.

2 Keywords

Linux, Block Layer, Solid State Drives, Non-volatile Memory, Latency, Throughput.

3 Introduction

As long as secondary storage has been synonymous with hard disk drives (HDD), IO latency and throughput have been shaped by the physical characteristics of rotational devices: Random accesses that require disk head movement are slow and sequential accesses that only require rotation of the disk platter are fast. Generations of IO intensive algorithms and systems have been designed based on these two fundamental characteristics. Today, the advent of solid state disks (SSD) based on non-volatile memories (NVM) (e.g., Flash or phase-change memory [11, 6]) is transforming the performance characteristics of secondary storage. SSDs often exhibit little latency difference between sequential and random IOs [16]. IO latency for SSDs is in the order of tens of microseconds as opposed to tens of milliseconds for HDDs. Large internal data parallelism in SSDs disks enables many concurrent IO operations which, in turn, allows single devices to achieve close to a million IOs per second (IOPS) for random accesses, as opposed to just hundreds on traditional magnetic hard drives. In Figure 1, we illustrate the evolution of SSD performance over the last couple of years.

A similar, albeit slower, performance transformation has already been witnessed for network systems. Ethernet speed evolved steadily from 10 Mb/s in the early 1990s to 100 Gb/s in 2010. Such a regular evolution over a 20 years period has allowed for a smooth transition between lab prototypes and mainstream deployments over time. For storage, the rate of change is much faster. We have seen a 10,000x improvement over just a few years. The throughput of modern storage devices is now often limited by their hardware (i.e., SATA/SAS or PCI-E) and software interfaces [28, 26]. Such rapid leaps in hardware performance have exposed previously unnoticed bottlenecks at the software level, both in the operating system and application layers. Today, with Linux, a single CPU core can sustain an IO submission rate of around 800 thousand IOPS. Regardless of how many cores are used to submit IOs, the operating system block layer can not scale up to over one million IOPS. This may be fast enough for today's SSDs - but not for tomorrow's.

We can expect that (a) SSDs are going to get faster, by increasing their internal parallelism1 [9, 8] and (b) CPU performance will improve largely due to the addition of more cores, whose performance may largely remain stable[24, 27].

If we consider a SSD that can provide 2 million IOPS, applications will no longer be able to fully utilize a single storage device, regardless of the number of threads and CPUs it is parallelized across due to current limitations within the operating system.

Because of the performance bottleneck that exists today within the operating system, some applications and device drivers are already choosing to bypass the Linux block layer in order to improve performance [8]. This choice increases complexity in both driver and hardware implementations. More specifically, it increases duplicate code across error-prone driver implementations, and removes generic features such as IO scheduling and quality of service trafic shaping that are provided by a common OS storage layer.

Rather than discarding the block layer to keep up with improving storage performance, we propose a new design that fixes the scaling issues of the existing block layer, while preserving its best features. More specifically, our contributions are the following:

We recognize that the Linux block layer has become a bottleneck (we detail

our analysis in Section 2). The current design employs a single coarse lock design for protecting the request queue, which becomes the main bottleneck to overall storage performance as device performance approaches 800 thousand IOPS. This single lock design is especially painful on parallel CPUs, as all cores must agree on the state of the request queue lock, which quickly results in significant performance degradation.

We propose a new design for IO management within the block layer. Our design

relies on multiple IO submission/completion queues to minimize cache coherence across CPU cores. The main idea of our design is to introduce two levels of queues within the block layer: (i) software queues that manage the IOs submitted from a given CPU core (e.g., the block layer running on a CPU with 8 cores will be equipped with 8 software queues), and (ii) hardware queues mapped on the underlying SSD driver submission queue.

We evaluate our multi-queue design based on a functional implementation

within the Linux kernel. We implement a new no-op block driver that allows developers to investigate OS block layer improvements. We then compare our new block layer to the existing one on top of the noop driver (thus focusing purely on the block layer performance). We show that a two-level locking design reduces the number of cache and pipeline flushes compared to a single level design, scales gracefully in high NUMA-factor architectures, and can scale up to 10 million IOPS to meet the demand of future storage products.

The rest of the paper is organized as follows: In Section 2 we review the current implementation of the Linux block layer and its performance limitations. In Section 3 we propose a new multi-queue design for the Linux block layer. In Section 4 we describe our experimental framework, and in Section 5, we discuss the performance impact of our multi-queue design. We discuss related work in Section 6, before drawing our conclusions in Section 7.

4 OS Block Layer

Simply put, the operating system block layer is responsible for shepherding IO requests from applications to storage devices [2]. The block layer is a glue that, on the one hand, allows applications to access diverse storage devices in a uniform way, and on the other hand, provides storage devices and drivers with a single point of entry from all applications. It is a convenience library to hide the complexity and diversity of storage devices from the application while providing common services that are valuable to applications. In addition, the block layer implements IO-fairness, IO-error handling, IO-statistics, and IO-scheduling that improve performance and help protect end-users from poor or malicious implementations of other applications or device drivers.

4.1 Architecture

Figure 2 illustrates the architecture of the current Linux block layer. Applications submit IOs via a kernel system call, that converts them into a data structure called a block IO. Each block IO contains information such as IO address, IO size, IO modality (read or write) or IO type (synchronous/ asynchronous)¹. It is then transferred to either libaio for asynchronous IOs or directly to the block layer for synchronous IO that submit it to the block layer. Once an IO request is submitted, the corresponding block IO is buffered in the staging area, which is implemented as a queue, denoted the request queue. Once a request is in the staging area, the block layer may perform IO scheduling and adjust accounting information before scheduling IO submissions to the appropriate storage device driver. Note that the Linux block layer supports pluggable IO schedulers: noop (no scheduling), deadline-based scheduling [12], and CFQ [10] that can all operate on IO within this staging area. The block layer also provides a mechanism for dealing with IO completions: each time an IO completes within the device driver, this driver calls up the stack to the generic completion function in the block layer. In turn the block layer then calls up to an IO completion function in the libaio library, or returns from the synchronous read or write system call, which provides the IO completion signal to the application. With the current block layer, the staging area is represented by a request queue structure. One such queue is instantiated per block device. Access is uniform across all block devices and an application need not know what the control flow pattern is within the block layer. A consequence of this single queue per device design however is that the block layer cannot support IO scheduling across devices.

4.2 Scalability

We analyzed the Linux kernel to evaluate the performance of the current block layer on high performance computing systems equipped with high-factor NUMA multi-core processors and high IOPS NAND-Flash SSDs. We found that the block layer had a considerable overhead for each IO; Specifically, we identified three main problems, illustrated in Figure 3:

Request Queue Locking: The block layer fundamentally synchronizes shared

accesses to an exclusive resource: the IO request queue. (i) Whenever a block IO is inserted or removed from the request queue, this lock must be acquired. (ii) Whenever the request queue is manipulated via IO submission, this lock must be acquired. (iii) As IOs are submitted, the block layer proceeds to optimizations such as plugging (letting IOs accumulate before issuing them to hardware to improve cache efficiency), (iv) IO reordering, and (v) fairness scheduling. Before any of these operations can proceed, the request queue lock must be acquired. This is a major source of contention.

Hardware Interrupts: The high number of IOPS causes a proportionally high

number of interrupts. Most of today's storage devices are designed such that one core (within CPU 0 on Figure 3) is responsible for handling all hardware interrupts and forwarding them to other cores as soft interrupts regardless of the CPU issuing and completing the IO. As a result, a single core may spend considerable time in handling these interrupts, context switching, and polluting L1 and L2 caches that applications could rely on for data locality [31]. The other cores (within CPU N on Figure 3) then also must take an IPI to perform the IO completion routine. As a result, in many cases two interrupts and context switches are required to complete just a single IO.

Remote Memory Accesses: Request queue lock contention is exacerbated when it

forces remote memory accesses across CPU cores (or across sockets in a NUMA architecture). Such remote memory accesses are needed whenever an IO completes on a different core from the one on which it was issued. In such cases, acquiring a lock on the request queue to remove the block IO from the request queue incurs a remote memory access to the lock state stored in the cache of the core where that lock was last acquired, the cache line is then marked shared on both cores. When updated, the copy is explicitly invalidated from the remote cache. If more than one core is actively issuing IO and thus competing for this lock, then the cache line associated with this lock is continuously bounced between those cores.

Figure 4 shows 512 bytes IOs being submitted to the kernel as fast as possible; IOPS throughput is depicted as a function of the number of CPU's that are submitting and completing IOs to a single device simultaneously. We observe that when the number of processes is lower than the number cores on a single socket (i.e., 6), throughput improves, or is at least maintained, as multiple CPU's issue IOs. For 2, 4, and 8-socket architectures which have largely supplanted single socket machines in the HPC space, when IOs are issued from a CPU that is located on a remote socket(and typically NUMA node), absolute performance drops substantially regardless the absolute number of sockets in the system.

Remote cacheline invalidation of the request queue lock is significantly more costly on complex four and eight socket systems where the NUMA-factor is high and large cache directory structures are expensive to access. On four and eight socket architectures, the request queue lock contention is so high that multiple sockets issuing IOs reduces the throughput of the Linux block layer to just about 125 thousand IOPS even though there have been high end solid state devices on the market for several years able to achieve higher IOPS than this. The scalability of the Linux block layer is not an issue that we might encounter in the future, it is a significant problem being faced by HPC in practice today.

5 Multi-Queue Block Layer

As we have seen in Section 2.2, reducing lock contention and remote memory accesses are key challenges when redesigning the block layer to scale on high NUMA-factor architectures. Dealing efficiently with the high number of hardware interrupts is beyond the control of the block layer(more on this below) as the block layer cannot dictate how a device driver interacts with its hardware. In this Section, we propose a two-level multi-queue design for the Linux block layer and discuss its key differences and advantages over the current single queue block layer implementation. Before we detail our design, we summarize the general block layer requirements.

6 English Words

6.1 graciously ['greiʃəsli] adv. 优雅地

6.1.1 Basic Explains

adv. 和蔼地；仁慈地；雅致地

6.1.2 Web References

graciously: 和蔼地; 优雅地; 雅致地
Win graciously: 优雅的取胜
promise graciously: 有礼貌地允诺

6.2 commonplace ['kɒmənpleɪs] n. 司空见惯的事

6.2.1 Basic Explains

n. 老生常谈；司空见惯的事；普通的东西
adj. 平凡的；陈腐的

6.2.2 Web References

commonplace: 平凡的; 碌碌无为; 平庸的
commonplace book: 摘录簿; 札记书; 剪贴集
They Commonplace: 他们碌碌无为

6.3 synonymous [sɪ'nɒnɪməs] adj. 同义的

6.3.1 Basic Explains

adj. 同义的；同义词的；同义突变的

6.3.2 Web References

synonymous: 同义的; 同义词; 震震
synonymous name: 菌株其他名称; 其他名称
synonymous substitutions: 同义替换

6.4 advent ['ædvənt] n. 出现

6.4.1 Basic Explains

n. 到来；出现；基督降临；基督降临节

6.4.2 Web References

advent: 出现; 将临期; 降临节
Second Advent: 基督复临; 耶稣再临; 基督复临
Advent Sermon: 降临节证道讲章

6.5 magnetic [mæg'netɪk] adj. 有磁性的

6.5.1 Basic Explains

adj. 地磁的；有磁性的；有吸引力的

6.5.2 Web References

Magnetic: 磁性; 磁的; 磁力
magnetic tape: 磁带; 磁带; 磁性录音带
magnetic field: 磁场; 磁力场; 耐电源频率磁场测试

6.6 couple['kʌp(ə)l] of years n. 几年

6.6.1 Basic Explains

几年

6.6.2 Web References

a couple of years: 两年; 两三年的; 一两年
In a couple of years: 若干年后
a couple of years ago: 几年前

6.7 albeit [ɔːl'biːɪt] conj. 虽然，即使

6.7.1 Basic Explains

conj. 虽然；即使

6.7.2 Web References

albeit: 虽然; 尽管; 固然
Albeit Mildly: 尽管势头很温和
Albeit Monotonous: 尽管很枯燥

6.8 witness [ˈwɪtnɪs] n. 证人；vt. 证明

6.8.1 Basic Explains

n. 证人；目击者；证据
vt. 目击；证明；为…作证
vi. 作证人
n. (Witness)人名；(津)威特尼斯

6.8.2 Web References

Witnessed: 亲眼目睹; 见证; 目击
Witnessed name: 现场的名称; 目击名称; 目睹名称
I witnessed: 我见证

6.9 steadily ['stedɪlɪ] adv. 稳定地，稳固地

6.9.1 Basic Explains

adv. 稳定地；稳固地；有规则地

6.9.2 Web References

Steadily: 稳稳; 稳步; 稳中有升
Moving steadily: 稳扎稳打
steadily continuously: 持续地

6.10 sustain [sə'steɪn] vt. 维持，支撑

6.10.1 Basic Explains

vt. 维持；支撑，承担；忍受；供养；证实

6.10.2 Web References

sustain: 维持; 支撑; 保持
sustain losses: 蒙受损失; 遭受损失
Sustain Button: 持续按钮

6.11 propose [prə'pəʊz] vt. 提出

6.11.1 Basic Explains

vi. 建议；求婚；打算
vt. 建议；打算，计划；求婚

6.11.2 Web References

Propose: 求婚; 提出; 提议
propose explanation: 提出解释; 建议的说明; 提出的解释
Marriage propose: 提亲

6.12 investigate [ɪn'vestɪgeɪt] v. 调查，研究

6.12.1 Basic Explains

v. 调查；研究

6.12.2 Web References

investigate: 调查; 探讨; 追究
Investigate Mugging: 调查街头劫案
to investigate: 调查; 侦查; 探究

6.13 propose [prə'pəʊz] vi. 建议

6.13.1 Basic Explains

vi. 建议；求婚；打算
vt. 建议；打算，计划；求婚

6.13.2 Web References

Propose: 求婚; 提出; 提议
propose explanation: 提出解释; 建议的说明; 提出的解释
Marriage propose: 提亲

6.14 shepherd ['ʃepəd] vt. 牧羊

6.14.1 Basic Explains

vt. 牧羊；带领；指导；看管
n. 牧羊人；牧师；指导者
n. (Shepherd)人名；(英)谢泼德

6.14.2 Web References

shepherding: 放牧; 看管
extreme shepherding: 极品牧羊
Shepherding Song: 牧歌

6.15 malicious [mə'lɪʃəs] adj. 恶毒的，蓄意的

6.15.1 Basic Explains

adj. 恶意的；恶毒的；蓄意的；怀恨的

6.15.2 Web References

malicious: 怀恶意的; 恶意的; 恶毒的
Malicious prosecution: 恶意起诉; 诬告; 恶意起诉
malicious software: 恶意软件; 恶意软体; 流氓软件

6.16 proportionally [prəu'pɔ:ʃənəli] adv. 适当地

6.16.1 Basic Explains

adv. 成比例地；相称地，适当地

6.16.2 Web References

proportionally: 按比例; 成比例地; 比例
proportionally damped: 一般阻尼; 比例阻尼
dilut proportionally: 按比例稀释

6.17 exacerbate [ɪg'zæsəbeɪt; ek'sæs-] vt. 使加剧，使恶化

6.17.1 Basic Explains

vt. 使加剧；使恶化；激怒

6.17.2 Web References

exacerbate: 使加剧; 使恶化; 恶化
alleviate exacerbate: 减轻
exacerbate surroundings: 加剧环境恶化

6.18 depict [dɪ'pɪkt] vt. 描述，描画

6.18.1 Basic Explains

vt. 描述；描画

6.18.2 Web References

depicted: 描述; 描绘; 描写
landforms depicted: 地貌描绘
Depicted thing: 描绘事物

6.19 supplant [sə'plɑːnt] vt. 代替，排挤掉

6.19.1 Basic Explains

vt. 代替；排挤掉

6.19.2 Web References

supplanted: 硫酸灰分; 禅代
Having Supplanted: 经取代

6.20 substantially [səb'stænʃ(ə)lɪ] adv. 实质上，大体上，充分地；

6.20.1 Basic Explains

adv. 实质上；大体上；充分地

6.20.2 Web References

Substantially: 大幅度; 幅度; 大幅
substantially equivalent: 实质相当; 实质上相同; 实质等同
substantially similar: 实质相似; 实质类似

脚注:

See include linux/blk_types.h in the Linux kernel (kernel.org) for a complete description of the Block IO data structure.