Modern Ceph

  • Ceph Crimson OSD
    • Utilize
      • fast networking devices & storage devices
      • multi-core
      • modern techniques
    • Goals
      • bypass kernel
      • avoid memcpy
      • avoid lock contention
  • Ceph SeaStore

Modern Technologies

SeaStar

A C++ asynchronous programming framework.

  • user space task scheduler
    • no context switch to minimize system CPU acquisition

Ceph’s crimson-osd use SeaStar to simplify asynchronous development, and to relieve CPU from if-elses.

Data Plane Development Kit (DPDK)

Provide a simple, complete framework for fast packet processing in data plane applications.

  • polling-mode

Used as PCIe driver (environment) in SPDK, for MMIO (Memory-Mapped I/O), PCI BAR (Base Address Register), thus enabling NVMe CMB (Controller Memory Buffer) and achieving zero-copy, that kind of stuff.

Storage Performance Development Kit (SPDK)

The bedrock of SPDK is a user space, polled-mode, asynchronous, lockless NVMe driver.

  • user space: no context switch, minimize CPU usage
    • applications issue NVMe commands to device directly
  • polled-mode: minimize latency & latency jitter
    • it will fully acquire the assigned CPU
    • interrupt-mode is enabled by hardware, effectively involve kernel

      IO_URING

    • NVMe (DDIO) ensures polling only checks host memory (cache)
  • asynchronous
    • runtimes (reactors) are pinned to specific CPU cores

      If a request were not initially polled by the corresponding CPU / thread of the bounded NVMe queue pair (aka the owning thread), it is preferred that the request is forwarded to the correct thread via message passing mechenism, as opposed to introduce locking.

    • coroutines (spdk_threads) are scheduled by SPDK or a user-specified runtime (environment), instead of the operating system

      The default static scheduler just round-robin around reactors (on their designated core), polling for events.

  • lockless: ring buffer with CaS

Blobstore

For “blob” (or object) storage.

  • include/spdk/blob.h
  • lib/blob/

Device Abstraction

Abstraction (low-high) Size Function
Logical block 256B / 4KiB Device physical block
Page (aka Extent) 4KiB Device atomic op
Cluster Configurable (default 1MiB) Object size
Blob Multitude of clusters Logical object
Blobstore    
Op Atomicity
Data writes Guaranteed
Blob metadata update Manual* / On-offload
Blobstore metadata update On-offload**

* spdk_blob_sync_md()

** If not shutdown properly, it will take some time for the Blobstore to fully boot up, but consistency is still guaranteed.

Metadata

Stored in memory during runtime.

To avoid locking, a separate (SPDK-)thread is used to handle requests on metadata. However, it’s the caller’s responsibility not to mix up metadata requests with each other and with regular I/O requests.

Block Device Layer

A single generic library lib/bdev, plus a number of optional modules that implements various types of block devices (equivalent to device driver in a traditional operating system).

  • Bdev
    • include/spdk/bdev.h
    • lib/bdev/
  • Bdev module
    • include/spdk/bdev_module.h
    • module/bdev/

Bdevs can be layered! Bdevs that route I/O to other bdevs are often referred to as virtual bdevs (or vbdevs).

Fun fact, The pmem module internally uses libpmemblk from PMDK. Who said anything about user space, lockless and no context switching? pmemblk_write

Acceleration Framework

Intel I/O Acceleration Technology (I/OAT)

… I/OAT allows offloading data movement to dedicated hardware within the platform, recalim CPU cycles that would otherwise be spent on tasks like memcpy… Intel I/OAT can take advantage of PCI-Express nontransparent-bridging, which allows movement of memory blocks between two different PCIe connected motherboards, thus effectively allowing the movement of data between two different computers at nearly the same speed as moving data in memory of a single computer…

from Fast memcpy with SPDK and Intel I/OAT DMA Engine

Intel Data Streaming Accelerator (DSA)

Persistent Memory Development Kit (PMDK)

A collection of libraries for common use cases of SCM.

  • libpmem(2): low-level
  • libpmemobj: transactional object store
    • libpmemobj++: STL-like programming model
  • libpmemkv: key in DRAM, value in PMEM
  • libpmemblk: atomically updated block / memory file
  • libpmemlog: persistent log file
  • libpmemcache: use DRAM as LRU cache of PMEM
  • libpmemkind: DRAM as fast tier, PMEM as capatity tier

Source

Based on tag v16.2.5 (@ 0883bdea7337b95e4b611c768c0279868462204a).

  • HAVE_BLUESTORE_PMEM Compile Flag
    • pmem used as mmap()-ed file
  • Allocator type
    • BlueStore::shared_alloc defaults to "block" (see src/os/bluestore/BlueStore.cc/BlueStore::_create_alloc())

For all Ceph config options, see src/common/options.cc/get_global_options().

Readings