Extensible sync handler registration (register_sync_handler)
Hackers,
Motivation:
I am working on Lamstore, an extent based storage backend whose data lives
at byte offsets inside a volume file rather than as md segment files.
Lamstore uses a custom storage manager to route backend I/O, which works
today against stock PostgreSQL and Percona Server for PostgreSQL 18. The
checkpoint fsync pipeline, however, is a dead end: syncsw[] in sync.c is
a static const SyncOps[] indexed by the SyncRequestHandler enum, and there
is no way for an extension to install its own dispatch entry. An extension
cannot reuse SYNC_HANDLER_MD because mdsyncfiletag() resolves paths via
relpathperm() plus OpenTransientFile() plus fsync(), which does not reach
Lamstore storage.
The attached two patch series adds a dynamic register_sync_handler() API
to sync.c, parallel to RegisterCustomRmgr and the smgr_register shape
being developed on CF 5616. Built in handlers (MD, CLOG, commit_ts,
multixact_offset, multixact_member) keep their existing enum indices 0
through 4. Extensions get IDs starting at a new SYNC_HANDLER_FIRST_DYNAMIC.
The patch is strictly additive: zero WAL format changes, zero FileTag
layout changes, zero shared memory changes, zero catalog involvement.
Background:
Thomas Munro's 2019 refactor of the checkpointer fsync queue (commit
3eb77eba5a5, PG12) explicitly anticipated non md consumers. The commit
message reads:
"For now only md.c uses the new interface, but other users are being
proposed. Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one."
Seven years later the framework is in place but the registration
mechanism has never been added. This patch closes that gap.
Two reviewers have already voiced needs that this API makes trivial. In
Tristan Partin's 2024-01-12 message on the CF 5616 thread
(CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com),
Heikki Linnakangas wrote:
"[I would like] a debugging tool that checks that we're not missing
any fsyncs. I bumped into a few missing fsync bugs with unlogged
tables lately and a tool like that would've been very helpful."
And Andres Freund wrote:
"I've been thinking that we need a validation layer for fsyncs, it's
too hard to get right without testing. My thought was that we could
keep a shared hash table of all files created / dirtied at the fd
layer, with the filename as key and the value the current LSN. We'd
delete files from it when they're fsynced."
Both needs become straightforward with register_sync_handler(): a
validation extension loads in shared_preload_libraries, registers its
own handler, and observes every sync request flowing through
ProcessSyncRequests() regardless of the underlying storage manager.
This is strictly stronger than the fsync_checker extension in CF 5616
v5+, which wraps md and can only observe md backed data.
What it does now:
syncsw[] is declared as:
static const SyncOps syncsw[] = {
[SYNC_HANDLER_MD] = { mdsyncfiletag, ... },
[SYNC_HANDLER_CLOG] = { clogsyncfiletag, ... },
[SYNC_HANDLER_COMMIT_TS] = { committssyncfiletag, ... },
[SYNC_HANDLER_MULTIXACT_OFFSET] = { multixactoffsetssyncfiletag, ... },
[SYNC_HANDLER_MULTIXACT_MEMBER] = { multixactmemberssyncfiletag, ... },
};
Adding a new handler requires patching core. No extension path exists.
What it will do:
The patch converts syncsw[] from static const SyncOps[] to a heap
allocated static SyncOps * grown via doubling repalloc, bounded by
shared_preload_libraries count. Built in handlers are pre-registered at
their existing enum indices by a new InitSyncHandlers() called from
PostmasterMain before process_shared_preload_libraries(). Extensions
call register_sync_handler() from _PG_init and receive a stable int16
handler ID, which they store in their FileTag.handler field at register
time:
/* src/include/storage/sync.h */
typedef struct SyncOps {
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
bool (*sync_filetagmatches)(const FileTag *ftag, const FileTag *candidate);
} SyncOps;
extern int16 register_sync_handler(const SyncOps *ops, const char *name);
typedef enum SyncRequestHandler {
SYNC_HANDLER_MD = 0,
SYNC_HANDLER_CLOG = 1,
SYNC_HANDLER_COMMIT_TS = 2,
SYNC_HANDLER_MULTIXACT_OFFSET = 3,
SYNC_HANDLER_MULTIXACT_MEMBER = 4,
SYNC_HANDLER_FIRST_DYNAMIC = 5,
SYNC_HANDLER_MAX = INT16_MAX,
SYNC_HANDLER_NONE = -1,
} SyncRequestHandler;
Dispatch is syncsw[idx].fn() both before and after. Each caller that
dispatches through the table (ProcessSyncRequests, SyncPostCheckpoint,
RememberSyncRequest) caches the base pointer in a local SyncOps *ops
at function entry so the compiler keeps it in a register for the
lifetime of the function, matching the register allocation the
pre-patch static const array received. Verified with objdump on GCC
14.2 at -O2 that the per-dispatch instruction sequence (movswq, mov,
mov, lea, call *(%r14,%rax,8)) is byte identical between stock and
patched builds. The only remaining delta is one additional memory
load at function entry to fetch the syncsw pointer (mov 0x0(%rip),%r14
versus stock's lea 0x0(%rip),%r14): a single L1 cache hit, paid once
per call to ProcessSyncRequests, not per dispatch.
The 0002 patch adds src/test/modules/test_sync_handler with a TAP test
verifying registration, dispatch, HASH_BLOBS coalescing, and cycle_ctr
skip semantics. Test layout mirrors fsync_checker in CF 5616 v5+.
Risk:
Strictly additive. Zero WAL format changes. Zero FileTag layout
changes. Zero shared memory changes. Zero catalog involvement.
Per-dispatch assembly is byte identical to stock (see above). Built
in enum values (SYNC_HANDLER_MD=0 et al) are preserved. ABI safe.
Recovery path unaffected: extensions cannot register during recovery
because InitSyncHandlers() runs in PostmasterMain before any child
process, and the built ins are all pre-registered at fixed IDs. No
behaviour change when no extension registers.
Pre-empting Andres Freund's 2023 objections to CF 4428 v1:
Your four review comments on the earlier smgr_register prototype
(postgrespro.com thread id 2654666, 2023-07-01) apply here by analogy.
Addressing them directly:
1. "Not a good place to initialize, we'll need it in multiple places
that way. How about putting it into BaseInit()?"
Our InitSyncHandlers() is called from a single site, PostmasterMain,
immediately before process_shared_preload_libraries(). Built ins
register via a private sync_handler_register_internal() from that
one site. Child processes inherit the array via fork(), which is a
full POSIX memory barrier. No per-process re-initialization. This
is the same lifecycle pattern as RegisterCustomRmgr.
2. "This adds another level of indirection. I would rather limit the
number of registerable smgrs than do that."
Doubling repalloc is bounded by shared_preload_libraries count
(small). Amortized O(1) cost at preload time. Registration happens
exactly once per extension per postmaster startup, before any
backend exists. Per-dispatch hot path cost is zero by construction:
each caller hoists syncsw into a local SyncOps *ops at function
entry, so the compiler keeps the base pointer in a register and
the per-entry dispatch compiles to byte-identical assembly as the
pre-patch static const array (verified with objdump, see Risk
paragraph above). The only measurable delta is one additional
memory load at function entry, paid once per ProcessSyncRequests
call (not per dispatch). If you still prefer a hard cap I am happy
to add SYNC_HANDLER_MAX_DYNAMIC as a compile-time constant.
3. "Huh, what's that about?" (on pg_compiler_barrier in registration)
Not included. Registration is single threaded during preload. No
barrier needed.
4. "It looked to me like you determined this globally, why do we need
it in every entry then?" (on per entry size fields)
Not applicable. SyncOps is pure function pointers. No per entry
size tracking.
Relationship to CF 5616:
This patch is a narrow focused companion to CF 5616 (Extensible storage
manager API), not a dependency or replacement. CF 5616 makes smgrsw[]
dynamic; this patch does the same for syncsw[]. The two are orthogonal:
none of 5616 v6's six sub patches touch sync.c or sync.h. This patch
applies cleanly against master today without 5616, and introduces none
of 5616's unresolved design questions (hook-vs-registration, catalog
recovery, per tablespace vs per relation config, GUC chaining). If 5616
lands later the two compose naturally: extensions call smgr_register()
and register_sync_handler() back to back in their _PG_init.
Percona carryover:
The patch also applies cleanly to percona/postgres PSP_REL_18_STABLE
for our production deployment. sync.c and sync.h have no Percona
specific changes on that branch (verified via gh api), so the same
patch applies identically modulo mechanical offsets. Upstream master
is the primary submission venue. A companion PR against Percona's
branch will reference this thread.
Verified locally:
* make check-world green on master (PG 19devel) with both patches
applied
* make -C src/test/modules/test_sync_handler check green on both
upstream master and percona/postgres PSP_REL_18_STABLE
I will open a CF 59 (PG20-1) entry today and link this thread's
Message-Id.
Thanks for reading. Feedback, counterproposals, and design pushback
all welcome.
Greg Lamberson
Lamco Development
greg@lamco.io
Attachments:
v1-0001-Make-sync.c-syncsw-extensible-via-register_sync_h.patchapplication/octet-stream; name=v1-0001-Make-sync.c-syncsw-extensible-via-register_sync_h.patchDownload+293-47
v1-0002-Add-test-module-for-sync-handler-registration.patchapplication/octet-stream; name=v1-0002-Add-test-module-for-sync-handler-registration.patchDownload+366-8
v2 attached. While running the v1 patches under CFBot (and
subsequently reproducing locally with CPPFLAGS=-DEXEC_BACKEND on
Linux), I discovered that v1 mis-handles processes that do not
inherit the postmaster's address space via fork(). v2 fixes this
and also addresses several unrelated items that I should have
caught in v1 self-review.
Changes in v2-0001:
* EXEC_BACKEND load-order fix. On platforms without fork()
(Windows, plus any build with CPPFLAGS=-DEXEC_BACKEND), each
child process re-runs process_shared_preload_libraries() in its
own fresh address space. In v1, extension _PG_init() could
reach register_sync_handler() before the child had called
InitSync(), so the first dynamic registration landed at ID 0,
colliding with SYNC_HANDLER_MD, and the idempotent guard in
InitSyncHandlers() ("if NSyncHandlers != 0 return") then
suppressed built-in registration entirely for the rest of that
process's lifetime. v2 calls InitSyncHandlers() at the top of
register_sync_handler() and replaces the counter-based guard
with an explicit builtin_sync_handlers_registered flag. The
v1 test that reported "got id=0" on Windows and FreeBSD CFBot
now reports id=5 under both fork and EXEC_BACKEND.
* Documentation. v1 shipped without SGML docs; v2 adds
doc/src/sgml/custom-sync-handler.sgml (modeled on
custom-rmgr.sgml) and registers it in filelist.sgml and
postgres.sgml. The doc build is clean.
* Error-message style. The four errmsg() strings that embedded
function names ("register_sync_handler: ...",
"test_sync_handler: ...") are reworded per the error-message
style guide. The two developer-bug paths that used
ERRCODE_NULL_VALUE_NOT_ALLOWED are changed to elog(FATAL, ...)
since they cannot be triggered from SQL.
* Stale comment. The v1 comment in
sync_handler_register_internal() claiming "fork() is full
POSIX barrier" was accurate only on fork-based platforms.
Rewritten to cover both fork and EXEC_BACKEND paths.
* SYNC_HANDLER_NONE enum value. Previously implicit 5 (after
MULTIXACT_MEMBER=4), changed in v1 to explicit -1 so that a
sentinel "no handler" value cannot be confused with a valid
index. I audited uses in master: the only references are
!= comparisons in slru.c at lines 1057, 1442, and 1558, which
are value-agnostic. Flagging explicitly here because it is
an ABI-visible enum value change.
Changes in v2-0002:
* Error-message style fixes in the test module to match the
core-side cleanups above.
* pgindent pass (required adding TshSharedState via
--list-of-typedefs since it is a test-module-local type).
Verification for v2:
* make check-world under autoconf on Linux, fork-based: all PASS
* make check-world under autoconf on Linux,
CPPFLAGS=-DEXEC_BACKEND: all PASS
* meson test under Linux with c_args='-DEXEC_BACKEND':
344 OK / 34 SKIP / 0 FAIL
* test_sync_handler/001_basic: 5/5 under all four combinations
* src/test/recovery: 597/597 under -DEXEC_BACKEND
* test_slru: 18/18 under -DEXEC_BACKEND (SLRU is the main user
of SYNC_HANDLER_NONE; the enum value change is safe)
* pgindent, pgperltidy (20230309), pgperlcritic: clean
* headerscheck on src/include/storage/sync.h in regular and
--cplusplus modes: clean
* doc/src/sgml builds cleanly; new chapter renders as expected
I apologize for the v1 verification gap. v1's "make check-world
green" claim was accurate on fork-based Linux only and did not
exercise EXEC_BACKEND; that is the single most important reason
the Windows failure slipped through. My pre-submission checklist
now includes the -DEXEC_BACKEND path.
Thanks,
Greg