Ceph BlueStore Deep Dive

Overview

bluestore-metadata
BlueStore metadata
bluestore-transaction
BlueStore transaction
bluestore-state-machine
BlueStore state machine

Operations

OP_WRITE

write
Write procedure
write-mode
Write modes

src/os/bluestore/BlueStore.cc/BlueStore::_do_write_data()

Allocator

StupidAllocator

allocate() repeated calls allocate_int(), until allocated size reaches wanted size.

src/os/bluestore/StupidAllocator.cc/StupidAllocator::allocate_int()

To understand how StupidAllocator works, the private member

std::vector<ceph::interval_set<
    /*offset/length type*/uint64_t,
    /*map impl*/btree_map<uint64_t/*offset*/, uint64_t/*length*/>
    >> free;

must be explained: free list, as the name suggests, keeps track of available segments. The vector is indexed by magnitude of segment size, that is free[0] will be available segments of [0, 1) block size (bdev_block_size) and free[3] will be of [4, 8) bs segments. Since interval_set is an AssociativeContainer, the segments in a free list entry is naturally sorted by offset.

The number of entries in free is fixed to 10 on initialization, i.e. the maximum contiguous managed allocation block size is bdev_block_size << 9.

btree_map_t is like std::map, but implemented with B-Tree, rather than red-black tree, for smaller footprint.

StupidAllocator then acts as a buddy allocator:

  1. _choose_bin() - returns a chosen bin orig_bin from available segments

    Given target allocation size len, returns the minimum among effective bits of len and the last element of free list.

    PERF: Implemented with __builtin_clz(ll) Count Leading Zero instruction.

  2. For segments no smaller than orig_bins, i.e. entries after and including free[orig_bin], search heuristically from hint address.

    The default hint is the immediate address after last allocation.

  3. For segments no smaller than orig_bins, search from lowest address (up to hint, because already searched).
  4. For segments smaller than orig_bins, xxxx

    Allocate something at least.

Heuristic search range and its order (bin_start is orig_bin)
  1. Manage free.

HybridAllocator