PMDK
src/libpmemobj/palloc.c
also see Doug Lea malloc, jemalloc and tcmalloc.
Observation
Performance dominated by cacheline flush to persistent memory,
i.e. pmem_persist()
, bit more operations on volatile data won’t hurt.
Therefore adopts a hybrid approach, keep track of memory blocks in both transient (volative) and persistent memory.
Allocation
- For large memory blocks (256KiB) best-fit algorithm is used, coalesing not deffered.
-
For smaller sizes a segregated-fit algorithm is employed, 35 allocations classes. The first time a memory block from a class needed, an entire chunk (256KiB) is split into smaller blocks of 8x class’s size.
E.g. first allocation of 128B class, a 256KiB chunk is taken, this will give you 256 available blocks, with each block of size 8*128B=1024B. Blocks you don’t need this time will be inserted into a per-class tree.
Block size is chosen to be 8x class size, because it matches bitmap representation, which will be the on-media persistent representation.
On-media Layout |
Source Code
-
pmalloc()
Generates an
OPERATION_INTERNAL
(and locks its lane), then pass the operation topalloc
.The introduction of operation and redo-log is needed, for PMEM pool metadata manipulations should be consistent.
Redo Log (used for allocation metadata consistency, external one for user allocation, and internal one for log object allocation) Undo Log (used for user data consistency) PMDK associates generation number to undo log snapshots, so a simple atomic increment will invalidate (or to say, commit) past transactions, instead of having to iteratively removing log entries.
Redo and undo logging are complimentary. Use redo logging for performance-critical code and where deferred modifications are not a problem (flush only once, but not immediately visible); use undo logging where ease of use is important (flush on every update, therefore immediately visible).
-
palloc_operation()
copied from source code
Simplified allocation process flow is as follows:
- reserve a new block in the transient heap
- prepare the new block
- create redo log of required modifications
- chunk metadata
- offset of the new object
- commit and process the redo log
And similarly, the deallocation process:
- create redo log of required modifications
- reverse the chunk metadata back to the ‘free’ state
- set the destination of the object offset to zero
- commit and process the redo log
There’s an important distinction in the deallocation process - it does not return the memory block to the transient container. That is done once no more memory is available.
Parameters
struct palloc_heap *heap
-struct pmemobjpool::heap
instanceuint64_t off
- used when free / reallocuint64_t *dest_off
- [out] fresh allocated memory offset, will be updated consistentlysize_t size
- requested sizepalloc_constr constructor
- left NULLvoid *arg
- left NULLuint64_t extra_field
uint16_t object_flags
uint16_t class_id
- default 0, let PMDK decideheap_get_best_class()
, adopts an immediate-fit schemauint16_t arena_id
- default 0, let PMDK decideheap_thread_arena()
, adopts a least-used schemastruct operation_context *ctx
- the memory operation to perform
Return Value
- 0 - ok
- -1 - errno is set
Steps
- create ops
- alloc:
palloc_reservation_create()
reserve bucket in transient heap-
get alloc class (from
heap->heap_rt
heap runtime)The default alloc class is managed separately as
rt->default_bucket
.alloc class created with
alloc_class.c/alloc_class_collection_new()
duringheap.c/heap_boot()
. heap_bucket_acquire()
get alloc class (bucket, or free list) on thread (arena)heap_get_bestfit_block()
extract block from bucket- repeatedly
get_rm_bestfit()
- ravl: TODO:
- seglists: TODO:
heap_split_block()
if extracted whole block is wasteful
- repeatedly
alloc_prep_block()
reserve bucket in transient statewrite_header()
seememblock.h
- MEMORY_BLOCK_HUGE, MEMORY_BLOCK_RUN
- HEADER_LEGACY
- HEADER_COMPACT
- HEADER_NONE
- MEMORY_BLOCK_HUGE, MEMORY_BLOCK_RUN
-
- free: TODO:
- alloc:
palloc_exec_actions()
action_funcs[atype]->exec()
- POBJ_ACTION_TYPE_HEAP ->
palloc_heap_action_exec()
->act->m.mops->prep_hdr()
- MEMORY_BLOCK_HUGE
- MEMORY_BLOCK_RUN: prepare bitmap
- POBJ_ACTION_TYPE_MEM
- POBJ_ACTION_TYPE_HEAP ->
pmemops_drain()
operation_process()
ulog_entry_apply()
operation_process_persistent_redo()
ulog_store()
persistentmemcpy()
ulog (Unified Log) with checksumulog_process()
for each ulog do a callbackulog_clobber()
zero-out ulog metadata
operation_process_persistent_undo()
ulog_process()
- unlock &
action_funcs[atype]->on_unlock()
operation_finish()
Limitations
- Arenas of different threads share a global AVL tree tracking free segments, and then there’s the global ulog, limiting scalability [Poseidon]
- Still too many metadata updates performed on PMem
- Programming Persistent Memory - A Comprehensive Guide for Developers
Poseidon [Middleware’20]
-
Per-CPU sub-heap for scalability and high-performance
- Undo log for singleton alloc, micro log for transactional alloc, truncated on success
- Buddy list keeps track of free blocks
- Hash table (F2FS)
to manage memory block information (as opposed to AVL)
- If collision, linear probing; if probe failed, defragmentation; if still not, extend hash table (potential high tail latency)
- Lazy defragmentation (merge adjacent free blocks)
ArchTM [FAST’21]
- PMDK libpmemobj’s transaction incurs too many metadata updates on PMem
- Small writes (< 256B) causes CoW, wasting bandwidth and delaying transactions
- CoW design avoid double write and random write as in logging-based
- PMDK’s size-class-based memory allocation is locality unfriendly (contiguous allocation requests may be satisfied with separate blocks in free list), reducing chances of coalescing writes
2 design principles and 5 techniques:
- Avoid small writes on PM
- Logless
- Minimize metadata modification on PM with guaranteed crash consistency
- object lookup table on DRAM
- TODO: transaction ID
- Scalable persistent object referencing
- Encourage coalescable writes
- Consecutive allocation requests get contiguous memory blocks
- Avoid memory fragmentation