[PoC] Umbra: a remap-aware smgr prototype on PostgreSQL master

Started by Mingwei Jia2 months ago18 messageshackers

i@nayishan.top

2 months ago

Hi hackers,

Apologies if my earlier attempt did not reach the list correctly. I
am sending this as a single PoC introduction with repository links only,
rather than as an attached patch series.

I would like to share a working Proof-of-Concept for Umbra, an
alternative smgr implementation on PostgreSQL master.

To be clear about scope: this is not a merge-ready proposal, and it
is not a new table AM or a separate storage engine. The goal is
narrower: to make the current design, code structure, recovery model, and
patch decomposition concrete enough for technical discussion, and to
preserve a usable baseline for anyone interested in continuing the work.

Umbra operates at the smgr layer. The central idea is to decouple
logical page identity from physical page placement, so that the ordinary
first-dirty-after-checkpoint path does not have to rely on
PostgreSQL's default full-page-image path in the same way. In the
current prototype:

- PostgreSQL callers still work in logical block numbers.
- Umbra maintains lblk -> pblk translation in its own metadata fork.
- WAL can publish remap state explicitly.
- redo reconstructs the correct mapping view before replaying page
contents.

Umbra's metadata fork contains only two formats: a 512-byte
superblock for fork-level control state, and single-purpose MAP pages
for mapping entries. These are not ordinary heap/index pages. In that
respect they are closer to system control/state metadata such as
pg_control and pg_xact/SLRU pages, and they do not rely on PostgreSQL's
ordinary FPW path for data pages. Instead, they are protected by
Umbra-specific metadata WAL/redo rules for those two formats.

The implementation is currently organized in the repository as:

- P0: design notes and repository navigation
- P1-P9: code patches covering smgr boundary, metadata fork, MAP
subsystem, WAL/redo, checkpoint integration, preallocation, and compaction

Current verification state:

- final tip passes `make check`
- final tip passes `make -C src/test/recovery check`
- strict per-patch state is:
- P1-P5: all four matrix items pass
- P6: MD make check / MD recovery / UMBRA make check pass, but
UMBRA recovery does not pass
- P7-P9: all four matrix items pass

That boundary is intentional in the current decomposition: P6
establishes the WAL record / birth / basic redo state-machine layer,
while P7 closes the ordinary remap / block-reference remap / checkpoint-
boundary replacement loop.

I do not want to overclaim on performance. The numbers below should
be read as directional PoC signals, not as a final benchmark claim.

On a TPC-C-style workload (BenchmarkSQL), the current results are:

Throughput (`checksum=off`)

terminals | md + fpw=on | md + fpw=off | Umbra + fpw=on
----------+-------------+--------------+----------------
10 | 158709 | 154283 | 155781
50 | 577005 | 626954 | 656353
200 | 641899 | 981436 | 995635
500 | 322660 | 943295 | 859058
1000 | 275609 | 899631 | 729989

Throughput (`checksum=on`)

terminals | md + fpw=on | md + fpw=off | Umbra + fpw=on
----------+-------------+--------------+----------------
10 | 155754 | 152025 | 150606
50 | 601974 | 635597 | 650844
200 | 621176 | 1015923 | 938311
500 | 316950 | 972795 | 729801
1000 | 282713 | 891770 | 674865

WAL size ratio (`md + fpw=on` / `Umbra + fpw=on`)

terminals | checksum=on | checksum=off
----------+-------------+--------------
10 | 1.82 | 2.03
50 | 2.11 | 2.51
200 | 3.81 | 5.22
500 | 4.58 | 6.90
1000 | 4.87 | 6.55

At 1000 terminals, Umbra recovers roughly 85% of the throughput gap
between `md + fpw=on` and `md + fpw=off`, while reducing WAL volume by
roughly 4.9x (`checksum=on`) or 6.6x (`checksum=off`).

The `md + fpw=off` numbers should be read only as a sensitivity /
upper-bound reference, not as a correctness-equivalent baseline.

Known follow-up work still includes:

- deeper host-tree engineering around AIO
- `CREATE DATABASE` `WAL_LOG` copy path
- stronger primary/standby physical-page alignment validation
- more complete production-grade space management
- an explicit upper-layer owner model for `range-born / batch mapping
publish`

The last point is worth calling out explicitly: the current prototype
has internal range-shaped lifecycle operations, but it does not yet
claim a generic upper-layer `RangeMap` contract. I do not believe
that should be introduced without a clear upper-layer use site and
owner model.

For personal reasons, my availability for sustained follow-up may be
limited for some time. Rather than leave this work in a private or
half-documented state, I would prefer to put the current PoC and
design notes in front of the community while they are still coherent
and runnable.

If the direction looks interesting, I would welcome discussion,
criticism, or a future maintainer/collaborator willing to continue the
engineering work from this baseline.

Repository and design notes:

https://github.com/nayishan/postgre_umbra/tree/umbra-poc-pgmaster

Regards,
Mingwei Jia
i@nayishan.top

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#1)

[RFC PATCH v2 RESEND 0/10] Umbra: a remap-aware smgr prototype

Hi hackers,

This is a RESEND of the RFC v2 Umbra patch series, sent as a standard
threaded patch series.

The previous v2 attempt was sent as a single cover-letter message with
10 patch attachments. It was held for moderation and did not appear in
the public pgsql-hackers archives. This resend keeps the same code and
review scope, but sends the patches in the usual 0/10 .. 10/10 form.

This is a follow-up to my previous PoC note about Umbra, sent on
April 24, 2026:

https://www.mail-archive.com/pgsql-hackers%40lists.postgresql.org/msg227384.html

Umbra is an smgr-layer prototype on PostgreSQL master. It is not a
table AM proposal, and it does not try to introduce a separate storage
engine abstraction.

The central idea is still the same as in the previous note: decouple
logical block identity from physical page placement by maintaining
lblk -> pblk translation in Umbra metadata, so that ordinary data-page
updates after checkpoint do not have to rely on PostgreSQL's default
full-page-image path in the same way.

This is an RFC / proof-of-concept patch series, not a merge-ready
submission. I am mainly looking for feedback on these questions:

1. Umbra is built around the metadata fork. Is the current metadata-fork
layout and update protocol a plausible way to avoid a recursive
dependency on PostgreSQL's ordinary full-page-write mechanism? In
other words, can the metadata fork itself remain crash-safe without
introducing nested FPW requirements?

2. Is the smgr layer an acceptable boundary for this experiment, or
should this idea be discussed at a different layer?

3. Is the WAL / remap / redo correctness model conceptually sound,
especially the way remap state is published to WAL and reconstructed
before replaying page contents?

4. Should the patch series be split differently before deeper review?
If so, which part would make the best first independently reviewable
patch?

The foreground/background split is also unchanged from the previous
description. The current PoC deliberately uses a conservative reclaim
policy: foreground paths allocate physical pages monotonically and avoid
synchronous reclaim, defragmentation, or relocation in user write paths.
mapwriter handles MAP-page flushing and physical preallocation, while
mapcompactor handles longer-term reclaim and compaction work.

This should not be read as claiming that Umbra's overall correctness
model is simple. The point is narrower: the conservative reclaim policy
reduces the number of foreground/checkpoint/WAL-redo/reclaim
interleavings that the prototype has to support initially, while leaving
more aggressive space-convergence work to later engineering.

The verification state is the same as described in the previous note:
the final tip has passed `make check` and
`make -C src/test/recovery check` in both regular and `--with-umbra`
builds. The strict per-patch boundary remains:

- P1-P5: all four matrix items pass
- P6: MD make check / MD recovery / UMBRA make check pass, but UMBRA
recovery does not pass at that boundary
- P7-P9: all four matrix items pass

That boundary is intentional in the current decomposition: P6 establishes
the WAL record / birth / basic redo state-machine layer, while P7 closes
the ordinary remap / block-reference remap / checkpoint-boundary
replacement loop.

The performance numbers from the previous note should still be read only
as directional PoC signals, not as final benchmark claims. The
`md + full_page_writes=off` numbers are only a sensitivity / upper-bound
reference, not a correctness-equivalent baseline.

The repository remains available here as a supplementary reference:

https://github.com/nayishan/postgre_umbra/tree/umbra-poc-pgmaster

Mingwei Jia (10):
umbra: add patch 0 design notes and repository navigation
umbra: add patch 1 smgr implementation boundary
umbra: add patch 2 umfile physical file manager and metadata storage
primitives
umbra: add patch 3 metadata disk format and identity mapping bootstrap
umbra: add patch 4 shared-memory MAP cache and checkpoint flush
umbra: add patch 5 MAP access policy, translation, and materialization
umbra: add patch 6 WAL records, mapped birth, and redo state machine
umbra: add patch 7 checkpoint-boundary FPW replacement and
block-reference remap
umbra: add patch 8 checkpoint/mapwriter writeback and physical
preallocation
umbra: add patch 9 compactor framework and non-interference policy

README.md | 261 +-
README_ZH.md | 241 ++
configure | 38 +
configure.ac | 10 +
doc/umbra/ARCHITECTURE.md | 437 +++
doc/umbra/ARCHITECTURE_ZH.md | 282 ++
doc/umbra/PROTOTYPE.md | 86 +
doc/umbra/PROTOTYPE_ZH.md | 74 +
doc/umbra/REVIEW_GUIDE.md | 210 ++
doc/umbra/REVIEW_GUIDE_ZH.md | 133 +
doc/umbra/UMBRA_FPW_STORY.md | 708 +++++
doc/umbra/UMBRA_FPW_STORY_ZH.md | 500 ++++
doc/umbra/WAL_AND_REDO.md | 419 +++
doc/umbra/WAL_AND_REDO_ZH.md | 248 ++
meson.build | 1 +
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/access/brin/brin.c | 2 +-
src/backend/access/brin/brin_pageops.c | 4 +-
src/backend/access/brin/brin_revmap.c | 2 +-
src/backend/access/gin/gindatapage.c | 2 +-
src/backend/access/gin/ginfast.c | 8 +-
src/backend/access/gin/ginutil.c | 2 +-
src/backend/access/gist/gistxlog.c | 2 +-
src/backend/access/hash/hashovfl.c | 4 +-
src/backend/access/hash/hashpage.c | 16 +-
src/backend/access/heap/heapam.c | 6 +-
src/backend/access/heap/heapam_handler.c | 10 +-
src/backend/access/nbtree/nbtinsert.c | 8 +-
src/backend/access/nbtree/nbtpage.c | 14 +-
src/backend/access/rmgrdesc/Makefile | 5 +
src/backend/access/rmgrdesc/meson.build | 6 +
src/backend/access/rmgrdesc/umbradesc.c | 116 +
src/backend/access/rmgrdesc/xlogdesc.c | 25 +
src/backend/access/spgist/spgdoinsert.c | 14 +-
src/backend/access/transam/Makefile | 5 +
src/backend/access/transam/meson.build | 6 +
src/backend/access/transam/rmgr.c | 3 +
src/backend/access/transam/umbra_xlog.c | 366 +++
src/backend/access/transam/xlog.c | 6 +
src/backend/access/transam/xloginsert.c | 744 ++++-
src/backend/access/transam/xlogreader.c | 40 +
src/backend/access/transam/xlogutils.c | 560 +++-
src/backend/backup/basebackup.c | 22 +-
src/backend/catalog/storage.c | 198 +-
src/backend/commands/dbcommands.c | 19 +
src/backend/commands/sequence.c | 6 +-
src/backend/commands/tablecmds.c | 11 +-
src/backend/common.mk | 2 +-
src/backend/postmaster/Makefile | 6 +
src/backend/postmaster/bgworker.c | 14 +
src/backend/postmaster/mapcompactor.c | 151 +
src/backend/postmaster/mapwriter.c | 198 ++
src/backend/postmaster/meson.build | 7 +
src/backend/postmaster/postmaster.c | 7 +
src/backend/storage/Makefile | 5 +
src/backend/storage/buffer/bufmgr.c | 14 +-
src/backend/storage/map/Makefile | 25 +
src/backend/storage/map/map.c | 1547 ++++++++++
src/backend/storage/map/mapbgproc.c | 1063 +++++++
src/backend/storage/map/mapbuf.c | 428 +++
src/backend/storage/map/mapclock.c | 464 +++
src/backend/storage/map/mapflush.c | 665 +++++
src/backend/storage/map/mapinflight.c | 402 +++
src/backend/storage/map/mapinit.c | 239 ++
src/backend/storage/map/mapsuper.c | 1789 +++++++++++
src/backend/storage/map/meson.build | 12 +
src/backend/storage/meson.build | 3 +
src/backend/storage/smgr/Makefile | 9 +
src/backend/storage/smgr/bulk_write.c | 53 +-
src/backend/storage/smgr/md.c | 1 +
src/backend/storage/smgr/meson.build | 7 +
src/backend/storage/smgr/smgr.c | 359 ++-
src/backend/storage/smgr/umbra.c | 2659 +++++++++++++++++
src/backend/storage/smgr/umfile.c | 2613 ++++++++++++++++
src/backend/storage/sync/sync.c | 113 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/adt/dbsize.c | 14 +-
src/backend/utils/adt/pgstatfuncs.c | 25 +
src/backend/utils/cache/relcache.c | 12 +-
src/backend/utils/init/postinit.c | 8 +
src/backend/utils/misc/guc_parameters.dat | 202 ++
src/backend/utils/misc/guc_tables.c | 2 +
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/pg_waldump/.gitignore | 1 +
src/bin/pg_waldump/Makefile | 9 +
src/bin/pg_waldump/rmgrdesc.c | 3 +
src/include/access/rmgrlist.h | 3 +
src/include/access/umbra_xlog.h | 109 +
src/include/access/xloginsert.h | 4 +
src/include/access/xlogreader.h | 11 +
src/include/access/xlogrecord.h | 49 +
src/include/access/xlogutils.h | 3 +
src/include/catalog/pg_proc.dat | 20 +
src/include/catalog/storage.h | 1 +
src/include/pg_config.h.in | 3 +
src/include/postmaster/mapwriter.h | 28 +
src/include/storage/aio_types.h | 3 +-
src/include/storage/lwlocklist.h | 1 +
src/include/storage/map.h | 323 ++
src/include/storage/map_internal.h | 59 +
src/include/storage/mapsuper.h | 100 +
src/include/storage/mapsuper_internal.h | 174 ++
src/include/storage/smgr.h | 37 +-
src/include/storage/subsystemlist.h | 3 +
src/include/storage/sync.h | 2 +
src/include/storage/um_defs.h | 51 +
src/include/storage/umbra.h | 179 ++
src/include/storage/umfile.h | 122 +
src/test/recovery/meson.build | 22 +
.../t/053_umbra_map_superblock_watermark.pl | 104 +
.../recovery/t/054_umbra_map_fork_policy.pl | 62 +
.../t/055_umbra_mapwriter_activity.pl | 56 +
.../t/056_umbra_truncate_superblock.pl | 82 +
.../t/057_umbra_remap_crash_consistency.pl | 74 +
.../t/058_umbra_2pc_remap_recovery.pl | 90 +
.../t/059_umbra_compactor_relocation.pl | 91 +
.../060_umbra_reclaim_checkpoint_counters.pl | 82 +
.../t/061_umbra_fsm_vm_map_translation.pl | 117 +
.../t/062_umbra_truncate_drop_crash_matrix.pl | 108 +
...3_umbra_mainfork_head_unlink_checkpoint.pl | 60 +
...64_umbra_mainfork_internal_reclaim_seg0.pl | 283 ++
...umbra_mainfork_middle_reclaim_keep_seg0.pl | 356 +++
.../recovery/t/066_umbra_truncate_redo.pl | 64 +
src/test/recovery/t/067_umbra_remap_redo.pl | 90 +
...68_umbra_old_baseline_checkpoint_window.pl | 85 +
.../t/069_umbra_range_remap_zeroextend.pl | 101 +
.../t/070_umbra_hash_birth_block_remap.pl | 66 +
.../t/071_umbra_skip_wal_dense_map.pl | 65 +
.../t/072_umbra_ordinary_slim_block_remap.pl | 69 +
.../recovery/t/073_umbra_preallocate_guc.pl | 74 +
.../recovery/t/074_umbra_torn_page_remap.pl | 261 ++
132 files changed, 22588 insertions(+), 181 deletions(-)
create mode 100644 README_ZH.md
create mode 100644 doc/umbra/ARCHITECTURE.md
create mode 100644 doc/umbra/ARCHITECTURE_ZH.md
create mode 100644 doc/umbra/PROTOTYPE.md
create mode 100644 doc/umbra/PROTOTYPE_ZH.md
create mode 100644 doc/umbra/REVIEW_GUIDE.md
create mode 100644 doc/umbra/REVIEW_GUIDE_ZH.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY_ZH.md
create mode 100644 doc/umbra/WAL_AND_REDO.md
create mode 100644 doc/umbra/WAL_AND_REDO_ZH.md
create mode 100644 src/backend/access/rmgrdesc/umbradesc.c
create mode 100644 src/backend/access/transam/umbra_xlog.c
create mode 100644 src/backend/postmaster/mapcompactor.c
create mode 100644 src/backend/postmaster/mapwriter.c
create mode 100644 src/backend/storage/map/Makefile
create mode 100644 src/backend/storage/map/map.c
create mode 100644 src/backend/storage/map/mapbgproc.c
create mode 100644 src/backend/storage/map/mapbuf.c
create mode 100644 src/backend/storage/map/mapclock.c
create mode 100644 src/backend/storage/map/mapflush.c
create mode 100644 src/backend/storage/map/mapinflight.c
create mode 100644 src/backend/storage/map/mapinit.c
create mode 100644 src/backend/storage/map/mapsuper.c
create mode 100644 src/backend/storage/map/meson.build
create mode 100644 src/backend/storage/smgr/umbra.c
create mode 100644 src/backend/storage/smgr/umfile.c
create mode 100644 src/include/access/umbra_xlog.h
create mode 100644 src/include/postmaster/mapwriter.h
create mode 100644 src/include/storage/map.h
create mode 100644 src/include/storage/map_internal.h
create mode 100644 src/include/storage/mapsuper.h
create mode 100644 src/include/storage/mapsuper_internal.h
create mode 100644 src/include/storage/um_defs.h
create mode 100644 src/include/storage/umbra.h
create mode 100644 src/include/storage/umfile.h
create mode 100644 src/test/recovery/t/053_umbra_map_superblock_watermark.pl
create mode 100644 src/test/recovery/t/054_umbra_map_fork_policy.pl
create mode 100644 src/test/recovery/t/055_umbra_mapwriter_activity.pl
create mode 100644 src/test/recovery/t/056_umbra_truncate_superblock.pl
create mode 100644 src/test/recovery/t/057_umbra_remap_crash_consistency.pl
create mode 100644 src/test/recovery/t/058_umbra_2pc_remap_recovery.pl
create mode 100644 src/test/recovery/t/059_umbra_compactor_relocation.pl
create mode 100644 src/test/recovery/t/060_umbra_reclaim_checkpoint_counters.pl
create mode 100644 src/test/recovery/t/061_umbra_fsm_vm_map_translation.pl
create mode 100644 src/test/recovery/t/062_umbra_truncate_drop_crash_matrix.pl
create mode 100644 src/test/recovery/t/063_umbra_mainfork_head_unlink_checkpoint.pl
create mode 100644 src/test/recovery/t/064_umbra_mainfork_internal_reclaim_seg0.pl
create mode 100644 src/test/recovery/t/065_umbra_mainfork_middle_reclaim_keep_seg0.pl
create mode 100644 src/test/recovery/t/066_umbra_truncate_redo.pl
create mode 100644 src/test/recovery/t/067_umbra_remap_redo.pl
create mode 100644 src/test/recovery/t/068_umbra_old_baseline_checkpoint_window.pl
create mode 100644 src/test/recovery/t/069_umbra_range_remap_zeroextend.pl
create mode 100644 src/test/recovery/t/070_umbra_hash_birth_block_remap.pl
create mode 100644 src/test/recovery/t/071_umbra_skip_wal_dense_map.pl
create mode 100644 src/test/recovery/t/072_umbra_ordinary_slim_block_remap.pl
create mode 100644 src/test/recovery/t/073_umbra_preallocate_guc.pl
create mode 100644 src/test/recovery/t/074_umbra_torn_page_remap.pl

--
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#2)

[RFC PATCH v2 RESEND 01/10] umbra: add patch 0 design notes and repository navigation

---
README.md | 261 +++++++++++-
README_ZH.md | 241 +++++++++++
doc/umbra/ARCHITECTURE.md | 437 ++++++++++++++++++++
doc/umbra/ARCHITECTURE_ZH.md | 282 +++++++++++++
doc/umbra/PROTOTYPE.md | 86 ++++
doc/umbra/PROTOTYPE_ZH.md | 74 ++++
doc/umbra/REVIEW_GUIDE.md | 210 ++++++++++
doc/umbra/REVIEW_GUIDE_ZH.md | 133 ++++++
doc/umbra/UMBRA_FPW_STORY.md | 708 ++++++++++++++++++++++++++++++++
doc/umbra/UMBRA_FPW_STORY_ZH.md | 500 ++++++++++++++++++++++
doc/umbra/WAL_AND_REDO.md | 419 +++++++++++++++++++
doc/umbra/WAL_AND_REDO_ZH.md | 248 +++++++++++
12 files changed, 3583 insertions(+), 16 deletions(-)
create mode 100644 README_ZH.md
create mode 100644 doc/umbra/ARCHITECTURE.md
create mode 100644 doc/umbra/ARCHITECTURE_ZH.md
create mode 100644 doc/umbra/PROTOTYPE.md
create mode 100644 doc/umbra/PROTOTYPE_ZH.md
create mode 100644 doc/umbra/REVIEW_GUIDE.md
create mode 100644 doc/umbra/REVIEW_GUIDE_ZH.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY_ZH.md
create mode 100644 doc/umbra/WAL_AND_REDO.md
create mode 100644 doc/umbra/WAL_AND_REDO_ZH.md

diff --git a/README.md b/README.md
index f6104c038b..44bf57c782 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,250 @@
-PostgreSQL Database Management System
-=====================================
+# Umbra on PostgreSQL master

-This directory contains the source code distribution of the PostgreSQL
-database management system.
+[English](./README.md) | [中文](./README_ZH.md)

-PostgreSQL is an advanced object-relational database management system
-that supports an extended subset of the SQL standard, including
-transactions, foreign keys, subqueries, triggers, user-defined types
-and functions.  This distribution also contains C language bindings.
+This repository hosts the current Umbra prototype on top of PostgreSQL master.

-Copyright and license information can be found in the file COPYRIGHT.
+Umbra is a storage-manager variant in which selected relation forks keep
+ordinary PostgreSQL logical block numbers at the upper layers, but are stored
+through an internal logical-to-physical mapping layer underneath.  MAIN, FSM,
+and VM can therefore still be addressed as logical blocks while Umbra
+translates them to physical blocks stored in data-fork files.

-General documentation about this version of PostgreSQL can be found at
-<https://www.postgresql.org/docs/devel/>.  In particular, information
-about building PostgreSQL from the source code can be found at
-<https://www.postgresql.org/docs/devel/installation.html>.
+In this model, a remap means moving one logical block from its old physical
+block to a newly published physical block.  The purpose of that remap is to
+give ordinary checkpoint-boundary updates a different recovery baseline.
+Instead of overwriting the old physical page and logging a full-page image just
+to protect that overwrite, Umbra can publish a new physical page for the same
+logical block and record the old/new physical mapping in WAL.  During redo, a
+remap record is replayed through the mapping view expected by that record; a
+delta-only remap uses the old physical page plus WAL delta instead of treating
+the update as overwrite-in-place on the new physical page.  This is the
+mechanism Umbra uses to reduce ordinary full-page-image pressure while
+preserving crash-recovery ordering.

-The latest version of this software, and related software, may be
-obtained at <https://www.postgresql.org/download/>.  For more information
-look at our web site located at <https://www.postgresql.org/>.
+This branch family is correctness-first.  It is useful for design review,
+implementation reading, and testing.  It is not presented as a finished
+production feature.
+
+## Branch Layout
+
+- `umbra-poc-pgmaster`
+  - PostgreSQL master based Umbra PoC
+  - full implementation branch for full-tree reading and testing
+- `shadow-pg12-archive`
+  - archived PostgreSQL 12.2 shadow prototype
+
+## Current Scope
+
+The current implementation includes:
+
+- a `--with-umbra` build option and Umbra `smgr` integration
+- an internal metadata fork per relation that stores:
+  - the MAP superblock, which records fork-level state such as:
+    - logical EOF
+    - physical capacity
+    - committed allocator frontier
+  - MAP pages, which record per-block mapping facts:
+    - `lblk -> pblk` entries
+    - unmapped versus mapped state for ordinary logical blocks
+- a MAP subsystem that owns:
+  - logical-to-physical lookup
+  - shared superblock state and related runtime state for:
+    - logical EOF
+    - allocator/frontier state
+    - reclaim boundaries
+- two background workers:
+  - `mapwriter`
+    - MAP-page flush
+    - preallocation
+  - `mapcompactor`
+    - reclaim
+    - compaction
+- remap-aware WAL/redo support, including:
+  - remap-aware block headers on ordinary WAL records
+  - redo-side remap interpretation in `xlogutils.c`
+- Umbra recovery TAP coverage in `src/test/recovery`
+
+## Design In One Page
+
+Umbra should be read as a storage-layer split with six distinct pieces.
+
+1. Upper PostgreSQL layers keep ordinary logical addressing.
+   Relations, forks, and block numbers are still presented as logical objects
+   to normal PostgreSQL callers.  Umbra changes physical placement underneath
+   `smgr`; it does not ask upper layers to reason in physical block numbers.
+
+2. Persistent truth lives in the metadata fork.
+   Each relation has an internal metadata fork containing:
+   - a MAP superblock for fork-level facts such as logical EOF, physical
+     capacity, and the committed allocator frontier; this superblock is stored
+     as a small 512-byte metadata sector
+   - MAP pages for per-block `lblk -> pblk` mapping facts; these pages are
+     compact fixed-entry metadata pages, closer to CLOG-style metadata than to
+     ordinary PostgreSQL data pages, so they do not use ordinary data-page FPW
+     semantics
+
+3. Runtime access is split from physical file I/O.
+   - `umbra.c` owns mapped-fork runtime semantics.
+   - the MAP subsystem owns lookup and shared runtime state.
+   - `umfile.c` owns physical file and segment operations.
+
+4. WAL is the owner boundary for MAP state changes.
+   - fork-level superblock facts become redo-visible through WAL.
+   - physical-page lifecycle transitions become redo-visible through WAL.
+   - logical-to-physical mapping changes become redo-visible through WAL.
+   Ordinary block references carry page-replay remap metadata; Umbra rmgr
+   records cover explicit MAP lifecycle actions outside ordinary block
+   references.
+
+5. Redo replays remap records through the record's expected mapping view.
+   - redo first restores the old/new mapping view carried by the WAL record.
+   - without an image, replay uses the old physical page plus WAL delta, not
+     overwrite-in-place on the new physical page.
+   - with an image, redo installs the image into the newly published mapping.
+
+6. Background maintenance stays separate from the foreground access path.
+   - `mapwriter` handles MAP-page flush and preallocation
+   - `mapcompactor` handles reclaim and compaction
+   This keeps long-term space convergence out of the hot foreground allocation
+   path.
+
+## Documentation
+
+Detailed design notes live under [doc/umbra/](./doc/umbra/).
+
+Primary English documents:
+
+- [Architecture](./doc/umbra/ARCHITECTURE.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE.md)
+- [FPW-to-remap design story](./doc/umbra/UMBRA_FPW_STORY.md)
+
+Chinese companion material:
+
+- [Architecture](./doc/umbra/ARCHITECTURE_ZH.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO_ZH.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE_ZH.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE_ZH.md)
+- [FPW-to-remap 设计故事](./doc/umbra/UMBRA_FPW_STORY_ZH.md)
+
+## Testing Baseline
+
+The current correctness baseline is the md/Umbra matrix below.  When switching
+between modes in the same source tree, clean the previous build first.
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+One especially important recovery test is:
+
+```sh
+make -C src/test/recovery check PROVE_TESTS=t/074_umbra_torn_page_remap.pl
+```
+
+This test acts as a negative control in md mode and validates torn-page remap
+recovery in Umbra mode.
+
+## Preliminary Performance Indicators
+
+Current performance evidence is directional only.  Two early signals are worth
+showing together:
+
+- TPCC-style throughput under the same workload
+- WAL-size ratio under the same workload
+
+The throughput view matters because WAL-size reduction alone does not fully
+describe performance.  The fair default baseline is:
+
+- `md + fpw=on`
+
+The `md + fpw=off` numbers are useful as a sensitivity / upper-bound reference,
+not as a correctness-equivalent baseline.
+
+Common settings:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+### TPCC-Style Throughput
+
+#### Checksums Disabled
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ------- | ------------: | -------------: | ---------------: |
+| 10      |        158709 |         154283 |           155781 |
+| 50      |        577005 |         626954 |           656353 |
+| 200     |        641899 |         981436 |           995635 |
+| 500     |        322660 |         943295 |           859058 |
+| 1000    |        275609 |         899631 |           729989 |
+
+#### Checksums Enabled
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ------- | ------------: | -------------: | ---------------: |
+| 10      |        155754 |         152025 |           150606 |
+| 50      |        601974 |         635597 |           650844 |
+| 200     |        621176 |        1015923 |           938311 |
+| 500     |        316950 |         972795 |           729801 |
+| 1000    |        282713 |         891770 |           674865 |
+
+### WAL-Size Ratio
+
+- `md WAL bytes with full_page_writes=on`
+- divided by
+- `Umbra WAL bytes with full_page_writes=on`
+
+Larger values mean Umbra generated less WAL for the same workload.
+
+#### Checksums Disabled
+
+| clients | md WAL / Umbra WAL |
+| ------- | ------------------ |
+| 10      | 2.03               |
+| 50      | 2.51               |
+| 200     | 5.22               |
+| 500     | 6.90               |
+| 1000    | 6.55               |
+
+#### Checksums Enabled
+
+| clients | md WAL / Umbra WAL |
+| ------- | ------------------ |
+| 10      | 1.82               |
+| 50      | 2.11               |
+| 200     | 3.81               |
+| 500     | 4.58               |
+| 1000    | 4.87               |
+
+Taken together, the throughput and WAL-size numbers show that Umbra is not only
+reducing WAL volume.  Under the same workload, it also recovers a large part of
+the throughput lost to ordinary checkpoint-boundary full-page-image pressure.
+
+These numbers should be read as:
+
+- preliminary
+- directional
+- not yet a complete benchmark
+
+They should not be read as a final claim about throughput, latency, or full
+replication/recovery cost.
diff --git a/README_ZH.md b/README_ZH.md
new file mode 100644
index 0000000000..dee95115d2
--- /dev/null
+++ b/README_ZH.md
@@ -0,0 +1,241 @@
+# Umbra 在 PostgreSQL master 上的原型说明
+
+[English](./README.md) | [中文](./README_ZH.md)
+
+这个仓库承载了基于 PostgreSQL `master` 的当前 Umbra 原型。
+
+Umbra 可以理解成 PostgreSQL 存储管理层上的一层扩展：上层仍然按普通逻辑
+块号访问数据，而底层通过内部的“逻辑块到物理块”映射，把选定 fork 的内容
+写入实际物理块。这样 `MAIN`、`FSM`、`VM` 这些 fork 在上层看来仍然是普通
+逻辑块，物理布局变化则由 Umbra 在下层负责。
+
+在这个模型里，remap 指的是把同一个逻辑块从旧物理块切换到新发布的物理块。
+它的直接目的，是给 ordinary checkpoint-boundary 更新提供另一种恢复基线。
+传统 `md` 路径会覆盖旧物理页，因此需要 full-page image 来保护这次覆盖；
+Umbra 则可以为同一个逻辑块发布一个新的物理页，并在 WAL 中记录 old/new
+physical mapping。redo 时，remap record 要按该 record 期待的映射视图回放；
+delta-only remap 使用“旧物理页 + WAL delta”，而不是把这次更新理解成在新物理
+页上的原地覆盖。这样 Umbra 才能在保持 crash-recovery 顺序的同时，降低 ordinary
+full-page-image 压力。
+
+这条分支的目标是先把正确性、设计边界和可验证性建立起来。它适合做设计审阅、
+实现阅读和测试，但不应被表述为已经完成的生产特性。
+
+## 分支布局
+
+- `umbra-poc-pgmaster`
+  - 基于 PostgreSQL `master` 的 Umbra 原型分支
+  - 用于完整源码阅读和测试
+- `shadow-pg12-archive`
+  - PostgreSQL 12.2 时代的 `shadow` 原型归档分支
+
+## 当前实现范围
+
+当前实现包含：
+
+- `--with-umbra` 构建选项，以及 Umbra 在 `smgr` 层的接入
+- 每个 relation 的内部 `metadata fork`，用来存放：
+  - MAP superblock，负责记录 fork 级别状态，例如：
+    - 逻辑文件末尾
+    - 已物化的物理容量
+    - 已提交的分配前沿
+  - 普通 MAP page，负责记录逐块映射事实：
+    - `lblk -> pblk` 条目
+    - 普通逻辑块当前是否已经建立映射
+- 一个 MAP 子系统，负责：
+  - 逻辑块到物理块的查找
+  - superblock 共享状态及相关运行时状态管理，包括：
+    - 逻辑文件末尾
+    - 分配前沿
+    - 回收边界
+- 两个后台进程：
+  - `mapwriter`
+    - MAP page 刷盘
+    - 预分配物理空间
+  - `mapcompactor`
+    - 回收
+    - 压缩整理
+- 一组围绕 remap 与 redo 的 WAL/恢复支持，包括：
+  - 普通 WAL record 上的 remap 元数据
+  - `xlogutils.c` 中的 remap 解释与恢复路径
+- `src/test/recovery` 下的 Umbra recovery TAP 测试
+
+## 一页设计摘要
+
+Umbra 可以被理解成一个由六个层次组成的存储层拆分。
+
+1. 上层 PostgreSQL 保持普通逻辑寻址。
+   对普通 PostgreSQL 调用方来说，relation、fork 和块号这些对象仍然按逻辑
+   语义使用。Umbra 改变的是 `smgr` 下方的物理布局，而不是要求上层直接处理
+   物理块号。
+
+2. 持久化真相放在 `metadata fork` 中。
+   每个 relation 都有一个内部 `metadata fork`，里面包含：
+   - 一个 MAP superblock，用来记录 fork 级别事实，例如逻辑文件末尾、
+     已物化的物理容量，以及已提交的分配前沿；这个 superblock 是一个很小的
+     `512B` metadata sector
+   - 一组 MAP page，用来记录逐块的 `lblk -> pblk` 映射事实；这些 page 是由
+     固定大小 entry 组成的紧凑 metadata page，形式上更接近 CLOG 这类
+     metadata，而不是普通 PostgreSQL data page，因此不走普通 data-page FPW
+     语义
+
+3. 运行时访问路径和物理文件 I/O 明确分层。
+   - `umbra.c` 负责 mapped fork 的运行时访问语义。
+   - MAP 子系统负责查找和共享运行时状态。
+   - `umfile.c` 负责真正的物理文件和 segment 操作。
+
+4. WAL 是 MAP 状态变化的 owner 边界。
+   - fork 级 superblock 事实通过 WAL 成为 redo 可见状态。
+   - 物理页生命周期转换通过 WAL 成为 redo 可见状态。
+   - 逻辑页到物理页的映射变化通过 WAL 成为 redo 可见状态。
+   普通 block reference 携带页面回放需要的 remap 元数据；普通 block
+   reference 之外的显式 MAP lifecycle 动作由 Umbra rmgr record 表达。
+
+5. redo 按 WAL record 期待的映射视图回放 remap。
+   - redo 先恢复该 record 携带的 old/new mapping view。
+   - 没有 image 时，回放基线是“旧物理页 + WAL delta”，不是在新物理页上做
+     原地覆盖。
+   - 有 image 时，redo 把 image 安装到新发布的映射上。
+
+6. 后台维护和前台访问路径分开。
+   - `mapwriter` 负责 MAP page 刷盘和预分配
+   - `mapcompactor` 负责回收和压缩整理
+   这样长周期的空间收敛就不会直接挤进前台热路径。
+
+## 文档
+
+更详细的设计说明放在 [doc/umbra/](./doc/umbra/)。
+
+英文主文档：
+
+- [Architecture](./doc/umbra/ARCHITECTURE.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE.md)
+- [FPW-to-remap design story](./doc/umbra/UMBRA_FPW_STORY.md)
+
+中文配套材料：
+
+- [Architecture](./doc/umbra/ARCHITECTURE_ZH.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO_ZH.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE_ZH.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE_ZH.md)
+- [FPW-to-remap 设计故事](./doc/umbra/UMBRA_FPW_STORY_ZH.md)
+
+## 测试基线
+
+当前正确性基线是 md/Umbra 双模式矩阵。在同一个源码树里切换构建模式时，
+先清理上一次构建。
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+一个特别重要的恢复测试是：
+
+```sh
+make -C src/test/recovery check PROVE_TESTS=t/074_umbra_torn_page_remap.pl
+```
+
+这个测试在 md 模式下是负对照，在 Umbra 模式下验证 torn-page remap
+recovery。
+
+## 初步性能指标
+
+当前性能证据只能视为方向性信号。这里同时给两类早期指标：
+
+- 同一工作负载下的 TPCC 风格吞吐
+- 同一工作负载下的 WAL 大小比值
+
+吞吐视角很重要，因为仅看 WAL 降幅并不能完整描述性能。公平的默认基线是：
+
+- `md + fpw=on`
+
+而 `md + fpw=off` 更适合作为敏感性 / 上界参考，不应被看作与正确性约束等价
+的基线。
+
+公共设置：
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+### TPCC 风格吞吐
+
+#### Checksums 关闭
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ---- | ------------: | -------------: | ---------------: |
+| 10   |        158709 |         154283 |           155781 |
+| 50   |        577005 |         626954 |           656353 |
+| 200  |        641899 |         981436 |           995635 |
+| 500  |        322660 |         943295 |           859058 |
+| 1000 |        275609 |         899631 |           729989 |
+
+#### Checksums 开启
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ---- | ------------: | -------------: | ---------------: |
+| 10   |        155754 |         152025 |           150606 |
+| 50   |        601974 |         635597 |           650844 |
+| 200  |        621176 |        1015923 |           938311 |
+| 500  |        316950 |         972795 |           729801 |
+| 1000 |        282713 |         891770 |           674865 |
+
+### WAL 大小比值
+
+- `md WAL bytes with full_page_writes=on`
+- 除以
+- `Umbra WAL bytes with full_page_writes=on`
+
+比值越大，表示 Umbra 在相同工作负载下生成的 WAL 越少。
+
+#### Checksums 关闭
+
+| 并发 | md WAL / Umbra WAL |
+| ---- | ------------------ |
+| 10   | 2.03               |
+| 50   | 2.51               |
+| 200  | 5.22               |
+| 500  | 6.90               |
+| 1000 | 6.55               |
+
+#### Checksums 开启
+
+| 并发 | md WAL / Umbra WAL |
+| ---- | ------------------ |
+| 10   | 1.82               |
+| 50   | 2.11               |
+| 200  | 3.81               |
+| 500  | 4.58               |
+| 1000 | 4.87               |
+
+把吞吐和 WAL 大小两组数字放在一起看，可以看到 Umbra 不只是降低了 WAL 体积；
+在相同工作负载下，它也回收了 ordinary checkpoint-boundary full-page-image
+压力带走的大部分吞吐。
+
+这些数字应被理解为：
+
+- 初步结果
+- 方向性信号
+- 尚不是完整 benchmark
+
+它们不应被读作关于 throughput、latency 或完整 replication/recovery cost 的
+最终结论。
diff --git a/doc/umbra/ARCHITECTURE.md b/doc/umbra/ARCHITECTURE.md
new file mode 100644
index 0000000000..7218b1a959
--- /dev/null
+++ b/doc/umbra/ARCHITECTURE.md
@@ -0,0 +1,437 @@
+# Umbra Architecture on PostgreSQL Master
+
+This document describes the current module boundaries and ownership rules of
+the PostgreSQL master Umbra PoC.
+
+The main architectural intent is:
+
+- upper PostgreSQL layers continue to speak in logical block numbers
+- Umbra translates mapped forks to physical block numbers underneath
+- the MAP subsystem owns persistent mapping facts
+- `umbra.c` owns runtime interpretation of those facts
+- `umfile.c` owns physical file operations
+
+## 1. Top-Level Layers
+
+Umbra is not a standalone engine next to PostgreSQL.  It is a storage-manager
+variant integrated into:
+
+- `smgr`
+- WAL record assembly
+- redo entry points
+- checkpoint/writeback
+- postmaster background workers
+
+The main layers are:
+
+- `src/backend/storage/smgr/umbra.c`
+  - Umbra `smgr` implementation
+- `src/backend/storage/smgr/umfile.c`
+  - low-level physical file and segment manager
+- `src/backend/storage/map/*`
+  - shared MAP metadata, buffer, superblock, and background-maintenance logic
+- `src/backend/access/transam/xloginsert.c`
+  - producer-side remap-aware WAL assembly
+- `src/backend/access/transam/xlogutils.c`
+  - redo-side remap interpretation
+- `src/backend/access/transam/umbra_xlog.c`
+  - Umbra rmgr records for MAP lifecycle operations
+
+## 2. Relation-Local Umbra State
+
+`SMgrRelation` no longer carries a public Umbra-specific struct layout.
+
+Umbra keeps its relation-local state behind `reln->umbra_private`, where
+`umbra.c` stores:
+
+- a borrowed `UmbraFileContext *`
+- an explicit relation-local MAP state
+
+The current MAP state is not derived from multiple booleans anymore.  It is an
+explicit state machine:
+
+- `UMBRA_MAP_POLICY_BYPASS_MAP`
+- `UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP`
+- `UMBRA_MAP_POLICY_REQUIRE_MAP`
+
+That state is seeded by create/open/redo owner points and then consumed by the
+runtime access path.
+
+## 3. Metadata Fork
+
+Umbra uses an internal metadata fork to store:
+
+- block 0: MAP superblock
+- blocks 1..: MAP pages
+
+The metadata fork is:
+
+- internal to Umbra
+- dense, not sparse
+- special-cased in path and sync handling
+
+The metadata fork stores mapping state for all three mapped forks, `MAIN`,
+`FSM`, and `VM`, but it does not store their page contents.  `MAIN/FSM/VM`
+remain relation forks addressed by logical block number at upper layers.  MAP
+pages in the metadata fork only answer: for this fork and this logical block,
+which physical block is current?
+
+The layout is not three independent map forks.  It is one metadata fork with
+fixed repeated groups:
+
+- block 0: MAP superblock
+- blocks 1..: repeated MAP page groups
+- each group starts with 1 FSM map page
+- then 1 VM map page
+- then 8192 MAIN map pages
+
+Each MAP page is a fixed-entry array, and each entry records one `lblk -> pblk`
+mapping for the corresponding fork.  In that sense, the metadata fork is one
+internal MAP file with formula-defined `mapfsm`, `mapvm`, and `mapmain` logical
+regions.
+
+Its page format is also not the same as ordinary PostgreSQL data pages.  Block
+0 is packed as a 512-byte MAP superblock sector.  Blocks 1.. are compact
+fixed-entry MAP metadata pages, closer in spirit to CLOG-style metadata than
+to heap/index data pages.  They therefore do not use ordinary data-page
+full-page-image semantics.
+
+This matters because internal metadata forks must not be passed through generic
+core helpers that only understand PostgreSQL's built-in forks.  Metadata path
+construction stays inside Umbra-aware helpers such as `UmMetadataRelPathPerm()`
+and related wrappers.
+
+## 4. `umbra.c`: Runtime Storage Semantics
+
+`src/backend/storage/smgr/umbra.c` sits at the `smgr` boundary.
+
+It owns:
+
+- access classification for a relation/fork
+- mapped-vs-bypass decisions
+- logical-block to physical-block translation
+- publish/consume rules for mapped births and remaps
+- metadata-fork lifecycle wrappers
+- `FileTag` conversion for Umbra-managed files
+
+It does not own:
+
+- raw segment file management
+- MAP page layout
+- shared superblock table logic
+
+The important separation in this file is:
+
+1. classify runtime access state
+2. consume MAP facts
+3. issue physical I/O through `umfile`
+
+That separation is why thin metadata wrappers such as `UmMetadataExists()` and
+`UmMetadataRead()` are still useful: they keep the internal metadata fork
+details localized instead of scattering `UMBRA_METADATA_FORKNUM` and dense-fork
+assumptions across the tree.
+
+## 5. `umfile.c`: Physical File Layer
+
+`src/backend/storage/smgr/umfile.c` owns the physical side of Umbra storage.
+
+It is responsible for:
+
+- backend-local file context registry
+- segment open/close management
+- dense versus sparse physical existence semantics
+- physical read/write/extend/zeroextend
+- unlink, sync, delayed-unlink, and write-session helpers
+
+Current writeback architecture uses `UmFileWriteSession` so callers such as MAP
+flush pass only storage identity and block information.  The MAP layer no
+longer needs to manipulate `UmbraFileContext` directly when flushing.
+
+Checkpoint/bgwriter writeback uses `umfile_write_session_begin_uncached()`,
+which intentionally avoids long-lived registry reuse in background processes.
+That prevents stale relation-local file state from being kept across relation
+lifecycle changes.
+
+## 6. MAP Subsystem
+
+The MAP subsystem is now split by functional domain.
+
+### 6.1 `map.c`
+
+This file now mainly owns:
+
+- mapping lookup
+- mapping allocation
+- mapping publication
+- truncate/lifecycle operations
+
+### 6.2 `mapbuf.c`
+
+This file owns MAP buffer-local state:
+
+- buffer state bits
+- pin/unpin
+- MAP buffer I/O ownership
+- using `MapMarkBufferDirty()` to ensure the corresponding metadata-fork block
+  exists before an ordinary MAP page is marked dirty
+
+The important rule here is:
+
+- ordinary MAP page modifications must be dirtied through
+  `MapMarkBufferDirty()`; if the metadata-fork block does not exist yet, that
+  path creates the MAP block first
+- checkpoint/writeback later writes existing blocks only
+
+This mirrors ordinary buffer-pool ownership more closely than the older design
+that allowed flush-time materialization.
+
+### 6.3 `mapflush.c`
+
+This file owns:
+
+- checkpoint flush of MAP buffers
+- checkpoint flush of superblocks
+- mapwriter background flush of ordinary MAP pages
+
+Current ownership rules are:
+
+- mapwriter flushes regular MAP pages only
+- checkpoint owns superblock flush
+- flush writes existing metadata blocks
+- flush no longer zeroextends missing metadata blocks on demand
+
+### 6.4 `mapbgproc.c`
+
+This file owns background maintenance:
+
+- preallocation
+- reclaim enqueue/dequeue work
+- compactor stepping
+- writer/compactor wakeup helpers
+
+`mapwriter` and `mapcompactor` are now driven directly from the MAP layer
+rather than through an `smgr` wrapper layer.
+
+### 6.5 `mapclock.c`
+
+This file owns:
+
+- clock sweep victim selection
+- MAP cache table
+- sync-start reporting
+
+### 6.6 `mapsuper.c`
+
+This file owns:
+
+- MAP superblock read/pack/CRC helpers
+- shared `MapSuperEntry` hash-table management
+- logical frontier, physical frontier, and allocator frontier updates
+- runtime extending state for fork materialization
+
+The current shared-entry model distinguishes:
+
+- logical EOF (`logical_nblocks`) in the on-disk superblock
+- materialized physical frontier (`phys_capacity` / physical nblocks) in the
+  on-disk superblock
+- committed allocator frontier (`next_free_phys_block`) in the on-disk
+  superblock
+- reservation frontier in `MapSuperEntry` runtime state only
+
+That split is important both for correctness and for WAL/redo.
+
+The allocator invariant is:
+
+- committed `next_free_phys_block <= reservation frontier`
+
+That should be asserted while holding `MapSuperEntry.lock`; reservation may run
+ahead in shared memory, but checkpoint-visible superblock state must not.
+
+### 6.7 `mapinit.c`
+
+This file owns:
+
+- shared-memory initialization
+- backend initialization
+- shared statistics
+- GUC-backed globals
+
+### 6.8 `mapinflight.c`
+
+This file owns in-flight remap ownership tracking.
+
+It uses per-MAP-buffer pending bits to serialize ownership of a logical MAP
+entry while a backend is preparing or publishing a new physical mapping.  The
+chosen physical block remains backend-local until WAL insertion commits the
+owner state.
+
+This mechanism is about ownership and barriers, not durable publication.  The
+durable superblock frontier must not advance here.  Runtime reservation state
+may run ahead in shared memory, but committed `next_free_pblkno` is published
+later by WAL-owned commit/redo.
+
+## 7. Checkpoint and Writeback Ownership
+
+One of the largest recent architectural cleanups is the checkpoint/writeback
+contract.
+
+The current contract is:
+
+- synthesized ordinary MAP pages carry `MAPBUF_NOT_MATERIALIZED`
+- if the corresponding metadata-fork block does not exist yet, the first writer
+  that dirties such a page creates that MAP block under the page content lock
+- checkpoint/mapwriter later write the existing block only
+
+That means:
+
+- missing MAP-block creation is no longer hidden inside flush
+- `MapFlushBuffer()` uses write-existing semantics
+- background writeback no longer invents missing metadata blocks
+
+This is the same broad ownership rule as the normal buffer pool:
+
+- extension / MAP-block creation happens before writeback
+- writeback persists existing dirty state
+
+## 8. Background Processes
+
+Umbra currently adds two background workers under `postmaster` when built with
+`--with-umbra`:
+
+- `mapwriter`
+  - sync-start accounting
+  - ordinary MAP page flush
+  - preallocation
+- `mapcompactor`
+  - relocation and reclaim work
+
+These processes call directly into the MAP layer rather than through a generic
+`smgr` forwarding API.  That keeps ownership clearer:
+
+- MAP background work belongs to the MAP subsystem
+- `smgr` remains the storage-manager boundary for relation storage calls
+
+## 9. WAL and Redo Boundaries
+
+The current boundary is:
+
+- `xloginsert.c`
+  - decides whether a block record carries remap metadata
+  - fills the remap header payload
+  - commits mapping/frontier publication after WAL insertion succeeds
+- `xlogutils.c`
+  - ensures metadata and MAP state for redo
+  - interprets remap-with-image and remap-without-image
+  - temporarily reconstructs the old mapping view when needed
+- `umbra_xlog.c`
+  - handles Umbra rmgr records such as `MAP_SET`, range remap, and reclaim
+
+The redo-entry layer owns remap interpretation because generic block-read
+helpers do not know enough about:
+
+- `has_remap`
+- `has_image`
+- `old_pblkno`
+- `new_pblkno`
+- frontier payload
+
+The detailed rules are described in [WAL_AND_REDO.md](./WAL_AND_REDO.md).
+
+## 10. Current Invariants
+
+The current code relies on these invariants:
+
+- metadata fork handling stays inside Umbra-aware helpers
+- runtime access state is explicit, not reconstructed from multiple booleans
+- ordinary MAP pages are materialized before flush
+- checkpoint/mapwriter write existing metadata blocks only
+- remap publication after WAL insertion is an owner action
+- redo owns redo-only metadata bootstrap and remap interpretation
+- skip-WAL dense-map WAL describes exact mapping/frontier facts, but does not
+  replace the existing data-file sync protocol
+- full-page images are still kept for explicit image owners and WAL
+  consistency checking
+
+These invariants should be treated as design constraints when reviewing later
+WAL-size optimizations.  For example, `next_free_pblkno` is a global allocator
+frontier, not a value that can always be replaced by `new_pblkno + 1`.
+
+## 11. Architectural Choices
+
+The current PoC makes two deliberate architectural choices that are worth
+stating explicitly.
+
+### 11.1 Space Cleanup Policy
+
+Once logical block numbering is decoupled from physical placement, Umbra has at
+least two possible physical-space-management models:
+
+1. immediately reuse freed physical blocks, closer to PostgreSQL's traditional
+   reusable-space style; or
+2. let the physical frontier move forward, while treating reclaim/reuse as a
+   later background concern instead of a foreground allocation requirement.
+
+The current PoC chooses the second model on purpose.
+
+The reason is not that reuse is impossible, but that immediate reuse would push
+substantial allocator complexity back into the foreground path:
+
+- free-space accounting would become part of normal allocation decisions
+- remap publication would need tighter coupling with reuse eligibility
+- WAL/redo would need to preserve more allocator-state invariants
+- in-flight ownership and recovery races would become harder to reason about
+
+After Umbra has already decoupled logical identity from physical placement, the
+main value of that decoupling is simplicity of ownership and correctness.  The
+foreground path therefore prefers monotonic physical advancement, while
+compaction/reclaim remain the place where old physical space is cleaned up and
+made reusable later.
+
+In other words:
+
+- immediate physical reuse is not the primary design goal of the PoC
+- correctness and simpler ownership are prioritized over aggressive reuse
+- reclaim exists, but it is intentionally a background policy rather than a
+  synchronous allocator contract
+
+### 11.2 Double-Buffering Boundary
+
+Umbra also deliberately keeps its buffering complexity inside Umbra-specific
+layers instead of trying to collapse everything into PostgreSQL's generic
+buffering model immediately.
+
+This means the PoC tolerates a double-buffering shape:
+
+- PostgreSQL keeps its ordinary upper buffer/cache behavior
+- Umbra keeps its own MAP buffers, superblock shared state, in-flight tracking,
+  and physical-file writeback state
+
+That choice is intentional for three reasons:
+
+1. it keeps Umbra-specific complexity inside Umbra rather than leaking remap,
+   allocator, and metadata-lifecycle rules into generic PostgreSQL buffer
+   ownership;
+2. it allows the project to measure and understand the system-level impact of
+   the extra buffering layer instead of assuming up front that it must be
+   eliminated; and
+3. it keeps the design open to future deployment models, including
+   cloud-oriented environments where storage-side services and local caching
+   boundaries may not match a traditional single-node assumption.
+
+This is a design trade-off, not a claim that double buffering is always ideal.
+The current PoC chooses modular isolation first, and leaves deeper buffer-model
+consolidation as a later optimization/design question.
+
+## 12. Open Architectural Debt
+
+The codebase is much cleaner than the earlier PG18-era branch, but a few
+medium size debts remain:
+
+- `mapsuper.c` is now the largest MAP module and could later be split into
+  on-disk-superblock helpers versus shared-super-entry management
+- `mapbgproc.c` still combines preallocation, reclaim, compaction, and wakeup
+  logic
+- `umfile.c` still mixes context/session management with raw segment/file
+  operations
+
+Those are now refactoring opportunities, not immediate ownership bugs.
diff --git a/doc/umbra/ARCHITECTURE_ZH.md b/doc/umbra/ARCHITECTURE_ZH.md
new file mode 100644
index 0000000000..343eb6eb73
--- /dev/null
+++ b/doc/umbra/ARCHITECTURE_ZH.md
@@ -0,0 +1,282 @@
+# Umbra 架构说明
+
+本文档是 `ARCHITECTURE.md` 的中文配套版本，说明当前 PostgreSQL master
+上的 Umbra 原型如何分层，以及各模块各自负责什么。
+
+## 1. 总体目标
+
+Umbra 不是 PostgreSQL 旁边的独立存储引擎，而是接在 PostgreSQL
+`storage manager` 边界上的一个存储管理原型。
+
+它的核心目标是：
+
+- 上层 PostgreSQL 继续只使用逻辑块号；
+- Umbra 在 `smgr` 下方把需要映射的 fork 翻译成物理块；
+- MAP 子系统持久化 `lblk -> pblk` 映射；
+- WAL 明确携带 remap 所需信息；
+- redo 能在恢复阶段确定性地重建映射关系和页面内容。
+
+## 2. 主要模块
+
+主要代码路径如下：
+
+- `src/backend/storage/smgr/umbra.c`
+  - Umbra 的 `smgr` 实现；
+  - 运行时访问策略；
+  - 逻辑块到物理块的翻译。
+- `src/backend/storage/smgr/umfile.c`
+  - 物理文件层；
+  - 段文件管理；
+  - dense/sparse 存在性判断；
+  - 同步、删除和延迟删除。
+- `src/backend/storage/map/`
+  - MAP 页；
+  - MAP buffer；
+  - superblock；
+  - checkpoint / mapwriter 刷盘；
+  - 预分配、回收、压实；
+  - in-flight owner 跟踪。
+- `src/backend/access/transam/xloginsert.c`
+  - WAL 生成端的 remap 判定；
+  - remap header 填充；
+  - WAL insert 成功后的映射发布。
+- `src/backend/access/transam/xlogutils.c`
+  - redo 端对 remap 的解释与执行。
+- `src/backend/access/transam/umbra_xlog.c`
+  - Umbra 自己的 rmgr 生命周期记录。
+
+## 3. relation-local 状态
+
+Umbra 的 relation-local 状态挂在 `SMgrRelation->umbra_private` 后面，不把
+Umbra 的内部结构暴露给普通 `smgr` 调用方。
+
+当前访问策略使用显式状态，而不是由多个布尔值拼装：
+
+- `UMBRA_MAP_POLICY_BYPASS_MAP`
+- `UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP`
+- `UMBRA_MAP_POLICY_REQUIRE_MAP`
+
+这些状态由 create/open/redo 对应的 owner 点建立，再由运行时访问路径消费。
+
+## 4. metadata fork
+
+每个 Umbra relation 都有一个内部 metadata fork：
+
+- block 0 是 MAP superblock；
+- block 1.. 是普通 MAP 页。
+
+metadata fork 是 Umbra 自己的内部结构，不是普通 PostgreSQL 用户可见的
+fork。因此 metadata 的路径、同步、删除以及 dense/sparse 语义都必须留在
+Umbra-aware helper 中，不能泄漏到通用 fork helper。
+
+metadata fork 同时保存 `MAIN`、`FSM`、`VM` 三类 mapped fork 的映射状态，但
+不保存这些 fork 的页面内容。也就是说，`MAIN/FSM/VM` 仍然是上层按逻辑块号访问
+的 relation fork；metadata fork 里的 MAP 页只回答“这个 fork 的某个逻辑块现在
+对应哪个物理块”。
+
+具体布局不是三个独立的 map fork，而是同一个 metadata fork 中的固定分组：
+
+- block 0：MAP superblock；
+- block 1..：重复的 MAP page group；
+- 每个 group 先放 1 个 FSM map page；
+- 再放 1 个 VM map page；
+- 再放 8192 个 MAIN map page。
+
+每个 MAP page 都由固定大小 entry 组成，每个 entry 记录对应 fork 中一个逻辑块
+的 `lblk -> pblk` 映射。因此可以把 metadata fork 理解成一个内部 MAP 文件，
+里面按稳定公式切出了 `mapfsm`、`mapvm`、`mapmain` 三类逻辑区域。
+
+它的页面格式也不同于普通 PostgreSQL data page。block 0 是按 `512B`
+sector 打包的 MAP superblock；block 1.. 是由固定大小 entry 组成的紧凑
+MAP metadata page，语义上更接近 CLOG 这类 metadata，而不是 heap/index
+data page。因此它们不使用普通 data-page full-page-image 语义。
+
+## 5. MAP 子系统分工
+
+当前 MAP 子系统按职责拆分如下：
+
+- `map.c`
+  - 查找；
+  - 分配；
+  - 映射发布；
+  - truncate / 生命周期处理。
+- `mapbuf.c`
+  - MAP buffer 状态；
+  - pin / unpin；
+  - buffer I/O 所有权；
+  - 通过 `MapMarkBufferDirty()` 保证普通 MAP 页标脏前，metadata fork 中已有
+    对应物理 block。
+- `mapflush.c`
+  - checkpoint 刷盘；
+  - mapwriter 刷盘；
+  - superblock 刷盘。
+- `mapbgproc.c`
+  - 预分配；
+  - 回收；
+  - compactor；
+  - writer / compactor 唤醒。
+- `mapclock.c`
+  - 时钟扫描；
+  - MAP 缓存表；
+  - sync-start 统计。
+- `mapsuper.c`
+  - superblock 的打包、解包和 CRC；
+  - 共享 `MapSuperEntry` 表；
+  - 逻辑 EOF、物理容量和分配前沿。
+- `mapinit.c`
+  - 共享内存初始化；
+  - backend 初始化；
+  - 由 GUC 驱动的全局状态。
+- `mapinflight.c`
+  - in-flight remap owner；
+  - 写屏障；
+  - pending 标记。
+
+## 6. superblock 状态拆分
+
+superblock 和共享 entry 里同时存在几类不同状态，这些状态不能混在一起：
+
+- `logical_nblocks`
+  - 逻辑 EOF；
+  - 持久化在 superblock 中。
+- `phys_capacity`
+  - 已经完成物理物化的容量；
+  - 持久化在 superblock 中。
+- `next_free_pblkno`
+  - 已提交的分配前沿；
+  - 持久化在 superblock 中；
+  - 由 WAL-owned commit / redo 发布。
+- reservation frontier
+  - 运行时的预留前沿；
+  - 只存在于 `MapSuperEntry` 的共享状态里；
+  - 不直接落盘。
+
+关键不变量是：
+
+```text
+committed next_free_pblkno <= runtime reservation frontier
+```
+
+也就是说，预留前沿可以在内存里领先，但 checkpoint 可见的已提交前沿不能跑到
+WAL 已经发布的状态前面去。
+
+## 7. checkpoint 和回写
+
+当前回写规则是：
+
+- 修改普通 MAP 页的路径必须通过 `MapMarkBufferDirty()` 标脏；如果 metadata
+  fork 中还没有对应物理 block，这条路径会先创建 MAP block；
+- checkpoint / mapwriter 只写已经存在的脏 MAP 页；
+- 刷盘阶段不创建缺失的 MAP 页；
+- superblock 的刷盘归 checkpoint 所有。
+
+这套规则刻意靠近 PostgreSQL 的普通 buffer pool：
+
+- 创建缺失的 MAP block 属于写回之前的动作；
+- 回写只负责把已经存在的脏状态持久化。
+
+## 8. mapwriter 和 mapcompactor
+
+Umbra 目前有两个后台 worker：
+
+- `mapwriter`
+  - 统计 MAP 分配压力；
+  - 刷普通 MAP 页；
+  - 做预分配。
+- `mapcompactor`
+  - 负责物理迁移；
+  - 负责回收。
+
+`mapwriter` 会扫描 `MapSuperEntry` 判断是否需要预分配，但它不负责把脏
+superblock 持久化。脏 superblock 仍由 checkpoint 负责刷盘。
+
+## 9. WAL / redo 边界
+
+WAL / redo 的边界如下：
+
+- `xloginsert.c`
+  - 生成 remap header；
+  - 在 WAL insert 成功后发布映射和 frontier。
+- `xlogutils.c`
+  - 在 redo 端解释 remap；
+  - 区分 remap-with-image 和 remap-without-image。
+- `umbra_xlog.c`
+  - 记录显式的 MAP 生命周期事件。
+
+redo 端必须理解 remap，因为普通的 block-read helper 并不知道：
+
+- 当前记录是否带 remap；
+- 是否带 image；
+- 旧物理基线是什么；
+- 新物理目标是什么；
+- frontier payload 是什么。
+
+## 10. 架构取舍
+
+当前原型有两项刻意保留的架构选择，需要明确写出来。
+
+### 10.1 空间清理策略
+
+一旦逻辑块号和物理块位置解耦，Umbra 至少有两种物理空间管理方式：
+
+1. 像 PostgreSQL 传统可复用空间那样，尽量立即复用已经释放的物理块；
+2. 让物理前沿持续向前推进，把 reclaim / reuse 作为后续后台清理策略，而不是
+   前台分配路径的同步要求。
+
+当前原型有意选择第二种。
+
+原因不是“不能复用”，而是如果前台路径立即承担复用，就会把一整套复杂度重新
+拉回分配主路径：
+
+- free-space accounting 会进入正常分配决策；
+- remap 发布会和 reuse eligibility 更紧地耦合；
+- WAL / redo 需要维护更多分配状态不变量；
+- in-flight 所有权与恢复竞争会更难推理。
+
+既然 Umbra 已经把逻辑身份和物理位置解耦，那么这层解耦带来的一个核心收益，
+就是所有权边界和正确性规则更简单。因此当前原型选择：
+
+- 前台路径优先让物理块单调向前推进；
+- reclaim / compaction 负责后续清理旧物理空间；
+- 复用存在，但它属于后台策略，不是前台同步 contract。
+
+换句话说：
+
+- 当前原型的首要目标不是“立即复用物理块”；
+- 当前优先级是正确性和更简单的所有权边界；
+- reclaim 确实存在，但它被刻意放在后台，而不是前台分配主路径上。
+
+### 10.2 双层 buffer 边界
+
+Umbra 还刻意保留了一层内部缓冲复杂度，而不是一开始就试图把所有东西直接压进
+PostgreSQL 的通用 buffer 模型。
+
+这意味着当前原型接受一种双层缓冲形态：
+
+- PostgreSQL 保留上层通用 buffer / cache 行为；
+- Umbra 保留自己的 MAP buffer、superblock 共享状态、in-flight 跟踪，以及
+  物理文件回写状态。
+
+这样做有三个原因：
+
+1. 尽量把 Umbra 特有的 remap、分配器、metadata 生命周期复杂度封装在
+   Umbra 内部，而不是泄漏到 PostgreSQL 的通用 buffer 所有权模型中；
+2. 先实际观察双层 buffer 对整个系统的影响，而不是预设它一定必须被消除；
+3. 给未来的部署模型留空间，包括云原生场景里可能出现的存储侧服务与本地缓存
+   边界；这些场景未必符合传统单机、单层 buffer 的假设。
+
+这是一种工程取舍，不是说双层 buffer 永远最优。当前原型的选择是：
+
+- 先保证模块隔离和语义清晰；
+- 更深入的 buffer 模型合并，留作后续优化和设计问题。
+
+## 11. 当前架构债务
+
+当前仍有一些工程债：
+
+- `mapsuper.c` 仍然偏大；
+- `mapbgproc.c` 同时包含预分配、回收、压实和唤醒逻辑；
+- `umfile.c` 同时包含 context / session 以及底层段文件操作；
+- compactor / reclaim 还不是最终的生产级空间管理。
+
+这些目前是后续重构点，不是当前原型最核心的正确性阻塞项。
diff --git a/doc/umbra/PROTOTYPE.md b/doc/umbra/PROTOTYPE.md
new file mode 100644
index 0000000000..fd9ea67654
--- /dev/null
+++ b/doc/umbra/PROTOTYPE.md
@@ -0,0 +1,86 @@
+# Umbra Prototype and Repository Navigation
+
+This document explains how the current PostgreSQL master PoC relates to the
+earlier PostgreSQL 12.2 shadow prototype.
+
+## 1. Repository Layout
+
+The public repository is intended to keep both the old prototype and the
+current master-port implementation in one place, separated by Git branches
+rather than by copying source trees into subdirectories.
+
+Repository:
+
+- `https://github.com/nayishan/postgre_umbra`
+
+Expected branch roles:
+
+- `umbra-poc-pgmaster`
+  - PostgreSQL master based Umbra PoC
+  - full implementation branch intended for community reading
+  - includes MAP metadata, WAL/redo integration, mapwriter, compactor, tests,
+    and documentation
+- `shadow-pg12-archive`
+  - archived PostgreSQL 12.2 shadow prototype
+  - useful for understanding the original minimal idea without the full
+    master-port integration burden
+
+The important rule should remain:
+
+- one repository
+- separate branches
+- clear README navigation
+- no mixing PostgreSQL 12.2 prototype files into the PostgreSQL master branch
+
+## 2. Why Keep The Prototype
+
+The PostgreSQL 12.2 shadow prototype is useful because it shows the original
+idea with fewer host-tree integration details.
+
+It is not a substitute for the master PoC, but it helps answer questions such
+as:
+
+- what is the minimal logical-to-physical mapping idea?
+- why does Umbra live below upper PostgreSQL logical block addressing?
+- how did the MAP state-machine idea evolve?
+- which parts are core design and which parts are master-port engineering?
+
+The master PoC is much larger because it must deal with:
+
+- current `smgr` boundaries
+- WAL block registration
+- redo paths
+- checkpoint/writeback
+- relation lifecycle
+- skip-WAL relations
+- background maintenance
+- TAP recovery tests
+
+## 3. How To Read The Two Branches
+
+Read the branches in this order if the goal is to understand the design:
+
+1. Read the repository `README.md` on `umbra-poc-pgmaster`.
+2. Read `doc/umbra/ARCHITECTURE.md` for the current module boundaries.
+3. Read `doc/umbra/WAL_AND_REDO.md` for the WAL and recovery model.
+4. Read `doc/umbra/REVIEW_GUIDE.md` for suggested review entry points.
+5. Read `doc/umbra/UMBRA_FPW_STORY_ZH.md` if the Chinese design story is useful
+   context.
+6. If anything is still unclear, go back to the shadow prototype for the
+   minimal mapping idea.
+
+The prototype should be treated as background material.  The master PoC is the
+branch that should be used for current testing and review.
+
+## 4. Development Transparency
+
+The original design direction, boundary choices, and state-machine reasoning
+come from the author.  The PostgreSQL 12.2 shadow prototype was used as an
+important reference while building the master PoC.
+
+The master-port implementation also used AI coding assistance extensively for
+repetitive implementation work and for code shaped after both the prototype
+and existing PostgreSQL subsystems.  That assistance was not sufficient to
+reason independently about database-kernel concurrency, WAL ordering, or
+recovery correctness.  The difficult part was repeatedly checking the logic,
+finding incorrect assumptions, and correcting the implementation.
diff --git a/doc/umbra/PROTOTYPE_ZH.md b/doc/umbra/PROTOTYPE_ZH.md
new file mode 100644
index 0000000000..2dafc78efd
--- /dev/null
+++ b/doc/umbra/PROTOTYPE_ZH.md
@@ -0,0 +1,74 @@
+# Umbra 原型与仓库导航
+
+本文档是 `PROTOTYPE.md` 的中文配套版本，说明当前 PostgreSQL master 上的
+Umbra 原型，与早期 PostgreSQL 12.2 shadow 原型之间的关系。
+
+## 1. 仓库结构
+
+建议把早期原型和当前原型放在同一个 GitHub 仓库里，通过分支隔离。
+
+仓库：
+
+- `https://github.com/nayishan/postgre_umbra`
+
+建议分支：
+
+- `umbra-poc-pgmaster`
+  - 基于 PostgreSQL master 的完整 Umbra 原型；
+  - 面向社区阅读和测试；
+  - 包含 MAP 元数据、WAL / redo、mapwriter、compactor、测试和文档。
+- `shadow-pg12-archive`
+  - PostgreSQL 12.2 的 shadow 原型归档分支；
+  - 适合理解最初的核心映射思路。
+
+原则如下：
+
+- 一个仓库；
+- 不同分支做物理隔离；
+- 不把 PG12 原型文件混入 master 原型分支；
+- 用根 `README` 提供清晰导航。
+
+## 2. 为什么保留原型
+
+PG12 shadow 原型的价值，在于展示最小逻辑：
+
+- 为什么要在 `smgr` 下方做逻辑块到物理块的映射；
+- 最原始的 MAP 状态机是什么；
+- 哪些是核心设计；
+- 哪些是迁移到 PostgreSQL master 之后才出现的工程复杂度。
+
+master 原型会更复杂，因为它必须处理：
+
+- 当前的 `smgr` 边界；
+- WAL block registration；
+- redo；
+- checkpoint / 回写；
+- relation 生命周期；
+- skip-WAL relation；
+- 后台维护；
+- recovery TAP。
+
+## 3. 阅读顺序
+
+建议按下面的顺序阅读：
+
+1. 先看 `umbra-poc-pgmaster` 分支上的仓库根目录 `README.md`；
+2. 后续等文档集导入该分支后，再看 `UMBRA_FPW_STORY_ZH.md`，理解更完整的
+   设计演化叙事；
+3. 后续等文档集导入该分支后，再看 `ARCHITECTURE.md`，理解模块边界；
+4. 后续等文档集导入该分支后，再看 `WAL_AND_REDO.md`，理解正确性的
+   owner model；
+5. 后续等文档集导入该分支后，再看 `REVIEW_GUIDE.md`，找到代码入口；
+6. 如果仍有不理解的地方，再回头看 shadow 原型，理解最小映射思路。
+
+原型是背景材料，master 原型才是当前测试和审阅的对象。
+
+## 4. 实现透明度
+
+核心架构、边界选择和状态机推演来自作者；PG12 shadow 原型也是 master 原型的
+重要参考。
+
+在 `master-port` 的实现过程中，也大量使用了 AI 编码助手来处理重复实现和迁移
+工作，并参考原型以及 PostgreSQL 现有实现来组织代码。但 AI 并不具备独立理解
+数据库内核并发、WAL 顺序和恢复正确性的能力；真正困难的部分，是持续审查逻辑、
+识别错误假设并修正实现。
diff --git a/doc/umbra/REVIEW_GUIDE.md b/doc/umbra/REVIEW_GUIDE.md
new file mode 100644
index 0000000000..0c5f8bf803
--- /dev/null
+++ b/doc/umbra/REVIEW_GUIDE.md
@@ -0,0 +1,210 @@
+# Umbra Review Guide
+
+This note is for reviewers and maintainers reading the Umbra PostgreSQL master
+PoC patch series. It does not replace the architecture and WAL/redo documents.
+It describes how to read the patch and which invariants should be checked
+first.
+
+## 1. Patch Shape
+
+The patch is not intended to hide subsystem boundaries.  The main review units
+are:
+
+- build flag and storage-manager dispatch
+- internal metadata fork and physical file layer
+- MAP buffer, superblock, in-flight owner, and write-barrier subsystem
+- WAL block-header remap encoding
+- redo-time remap interpretation
+- skip-WAL dense-map bootstrap
+- background mapwriter/mapcompactor maintenance
+- recovery and regression tests
+
+For a line-by-line review, it is usually better to read the patch by those
+units rather than by file order.
+
+## 2. What Umbra Changes
+
+Umbra keeps PostgreSQL's upper-layer logical block addressing.  The storage
+manager translates mapped forks from logical block numbers to physical block
+numbers underneath.
+
+The mapped forks are:
+
+- `MAIN_FORKNUM`
+- `FSM_FORKNUM`
+- `VISIBILITYMAP_FORKNUM`
+
+The persistent mapping state lives in an internal metadata fork owned by
+Umbra.  That metadata fork is not a normal PostgreSQL page fork and must not
+enter generic shared-buffer, full-page-image, checksum, or page-LSN paths.
+
+## 3. Where to Start Reading
+
+Start with:
+
+- `src/backend/storage/smgr/smgr.c`
+  - storage-manager dispatch and the `--with-umbra` boundary
+- `src/backend/storage/smgr/umbra.c`
+  - runtime access policy and logical-to-physical translation
+- `src/backend/storage/smgr/umfile.c`
+  - physical file operations below Umbra
+- `src/backend/storage/map/`
+  - MAP metadata, reservations, writeback, and background work
+- `src/backend/access/transam/xloginsert.c`
+  - producer-side remap decisions and header encoding
+- `src/backend/access/transam/xlogreader.c`
+  - remap header parsing
+- `src/backend/access/transam/xlogutils.c`
+  - redo-time remap interpretation
+- `src/backend/access/transam/umbra_xlog.c`
+  - Umbra rmgr records
+
+## 4. Core Correctness Invariants
+
+Review these invariants before focusing on micro-optimizations:
+
+- WAL publication wins before committed MAP publication.
+- A pending reservation chooses a physical block but does not publish the
+  logical-to-physical mapping or committed allocator frontier.
+- First-born pages publish logical EOF explicitly through WAL-owned remap or
+  range remap state.
+- Ordinary remap-without-image redo consumes the old physical baseline before
+  publishing the new physical mapping.
+- `next_free_pblkno` is the committed allocator frontier, not necessarily
+  `new_pblkno + 1`.
+- The runtime reservation frontier lives in `MapSuperEntry` shared state, not
+  in the on-disk superblock.
+- committed `next_free_pblkno <= reservation frontier` must hold under the
+  shared-entry lock and should be asserted in the implementation.
+- MAP superblock logical EOF, materialized physical frontier, and allocator
+  frontier are separate facts.
+- Checkpoint and mapwriter write existing MAP metadata blocks; they do not
+  materialize missing MAP blocks during flush.
+- Redo owns redo-only metadata bootstrap for mapped forks.
+
+## 5. WAL Review Checklist
+
+Umbra has two WAL-visible mechanisms.
+
+Block-reference remap metadata is attached to ordinary WAL records with
+`BKPBLOCK_HAS_REMAP`:
+
+- full remap header:
+  - `old_pblkno`
+  - `new_pblkno`
+  - `logical_nblocks`
+  - `next_free_pblkno`
+- compact birth header:
+  - `new_pblkno`
+  - `logical_nblocks`
+  - `next_free_pblkno`
+- ordinary slim header:
+  - `old_pblkno`
+  - `new_pblkno`
+  - `next_free_pblkno`
+
+Umbra rmgr records are separate lifecycle records:
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+The important review point is that these are complementary, not substitutes.
+Block-header remap is for replaying ordinary WAL block content against the
+right physical baseline.  Umbra rmgr records are for explicit MAP lifecycle
+events outside the ordinary block-reference owner.
+
+## 6. Full-Page Image Boundaries
+
+Umbra does not globally disable full-page writes.
+
+It replaces the ordinary checkpoint-boundary image path with remap metadata
+when the record is eligible for automatic remap.  Images are still kept when
+the caller explicitly owns an image or when consistency checking requires one.
+
+Known conservative cases:
+
+- `REGBUF_FORCE_IMAGE` keeps image semantics.
+- `XLR_CHECK_CONSISTENCY` keeps verification images.
+- `XLOG_FPI_FOR_HINT` keeps the PostgreSQL hint-image rule and does not use
+  Umbra remap today.
+
+That last point is deliberate.  Hint-bit FPI optimization would require a
+separate checksum/torn-page protection design; it is not a header encoding
+optimization.
+
+## 7. Skip-WAL Dense Map
+
+Skip-WAL relations are handled as a dense physical build while the relation is
+still in the skip-WAL pending window.
+
+The WAL anchor is `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`.  For each encoded fork it
+means:
+
+- `[0, nblocks)` is dense
+- `pblk == lblk` in that range
+- `logical_nblocks = nblocks`
+- `physical_nblocks = nblocks`
+- `next_free_pblkno = nblocks`
+
+The record does not encode empty forks.  An entry with `nblocks == 0` has no
+mapping work and should not be produced.
+
+The record is not a data-file fsync replacement and is not a generic
+`MAP_SUPER_INIT`.  The existing skip-WAL sync protocol still owns durability;
+the dense-map record gives redo an exact mapping/frontier anchor.
+
+## 8. What Is Intentionally Not Solved Here
+
+The current patch does not try to solve every possible WAL byte optimization.
+
+It intentionally does not add:
+
+- a tiny birth header that drops both frontier fields
+- per-block remap variant tags inside mixed records
+- remap optimization for checksum-driven hint FPIs
+- range relocation WAL for compactor moves
+- a default-on storage-manager behavior
+
+Those are separate follow-up designs.  The current patch favors deterministic
+ownership and reviewable replay semantics over maximum header compression.
+
+## 9. Test Baseline
+
+The current correctness baseline is the md/Umbra matrix below.  When switching
+between modes in the same source tree, clean the previous build first.
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+Umbra-only recovery tests are expected to skip in md mode and run in
+`--with-umbra` mode.
+
+The torn-page remap test is especially important:
+
+- `src/test/recovery/t/074_umbra_torn_page_remap.pl`
+  - md negative control with `full_page_writes=off`
+  - Umbra positive recovery path with `full_page_writes=on`
+  - recovery verification uses an ordered relation digest, not just row count
+
+## 10. Longer Reference Material
+
+Reviewers should also read:
+
+- [ARCHITECTURE.md](./ARCHITECTURE.md)
+- [WAL_AND_REDO.md](./WAL_AND_REDO.md)
+- [PROTOTYPE.md](./PROTOTYPE.md)
+- [UMBRA_FPW_STORY_ZH.md](./UMBRA_FPW_STORY_ZH.md)
diff --git a/doc/umbra/REVIEW_GUIDE_ZH.md b/doc/umbra/REVIEW_GUIDE_ZH.md
new file mode 100644
index 0000000000..b1de862a2e
--- /dev/null
+++ b/doc/umbra/REVIEW_GUIDE_ZH.md
@@ -0,0 +1,133 @@
+# Umbra 审阅指南（中文版）
+
+本文档是 `REVIEW_GUIDE.md` 的中文配套版本，用于说明阅读当前 Umbra 原型
+patch 序列时，先看什么、重点看什么。
+
+## 1. patch 的关注点
+
+审阅时最值得关注的是：
+
+- 架构边界是否合理；
+- `smgr` 接入是否清晰；
+- MAP 元数据的所有权边界是否清楚；
+- WAL / remap 的所有权模型是否正确；
+- redo 是否具备确定性；
+- checkpoint / 回写边界是否正确；
+- 测试是否覆盖了核心风险。
+
+## 2. 建议阅读顺序
+
+建议先从这些文件入手：
+
+- `src/backend/storage/smgr/smgr.c`
+  - `storage manager` 的分派逻辑；
+  - `--with-umbra` 的边界。
+- `src/backend/storage/smgr/umbra.c`
+  - 运行时访问策略；
+  - 逻辑块到物理块的翻译。
+- `src/backend/storage/smgr/umfile.c`
+  - 物理文件层；
+  - 段文件、同步、删除、dense/sparse 语义。
+- `src/backend/storage/map/`
+  - MAP 元数据、buffer、superblock、刷盘和后台工作。
+- `src/backend/access/transam/xloginsert.c`
+  - WAL 生成端的 remap 判定。
+- `src/backend/access/transam/xlogreader.c`
+  - remap header 的解析。
+- `src/backend/access/transam/xlogutils.c`
+  - redo 阶段对 remap 的解释。
+- `src/backend/access/transam/umbra_xlog.c`
+  - Umbra 自己的 rmgr 记录。
+
+## 3. 核心正确性不变量
+
+优先审阅这些不变量：
+
+- WAL 的发布必须先于已提交 MAP 的发布；
+- pending 预留只负责选择物理块，不发布已提交映射；
+- first-born 必须显式发布逻辑 EOF；
+- 不带 image 的 remap redo 必须先消费旧物理基线；
+- `next_free_pblkno` 表示已提交的分配前沿；
+- 运行时预留前沿只存在于共享内存中；
+- 已提交的 `next_free_pblkno <= reservation frontier`；
+- 逻辑 EOF、物理容量、分配前沿是不同事实；
+- checkpoint / mapwriter 只写已经存在的 MAP 元数据块；
+- redo 拥有只在恢复阶段需要的 metadata bootstrap。
+
+## 4. Full-Page Image 的边界
+
+Umbra 并没有全局关闭 full-page writes。
+
+它只是在满足条件的 checkpoint 边界普通场景里，用 remap 元数据替代默认的
+image 路径。
+
+保守边界如下：
+
+- `REGBUF_FORCE_IMAGE` 保留 image；
+- `XLR_CHECK_CONSISTENCY` 保留校验 image；
+- `XLOG_FPI_FOR_HINT` 当前不走 Umbra remap。
+
+hint-bit 的 FPI 优化需要单独的 checksum / torn-page 保护设计，不应该混入
+当前的 header 编码优化里。
+
+## 5. Skip-WAL Dense Map
+
+skip-WAL relation 在 pending 窗口内按 dense 物理布局处理。
+
+`XLOG_UMBRA_SKIP_WAL_DENSE_MAP` 表示：
+
+- `[0, nblocks)` 是 dense；
+- `pblk == lblk`；
+- `logical_nblocks = nblocks`；
+- `physical_nblocks = nblocks`；
+- `next_free_pblkno = nblocks`。
+
+这个记录不是 `fsync` 的替代品；它只是 redo 阶段的 mapping / frontier 锚点。
+
+## 6. 当前不解决的问题
+
+当前 patch 不试图解决所有 WAL 字节数优化。
+
+明确不作为当前目标的内容包括：
+
+- 更小的 birth header；
+- mixed record 中每个 block 各自独立的 variant tag；
+- 基于 checksum 的 hint FPI remap 优化；
+- 更激进的 compactor range relocation WAL；
+- 默认开启 storage manager；
+- 完整的生产级空间管理。
+
+当前优先级是确定性的所有权边界，以及可审阅的回放语义。
+
+## 7. 测试基线
+
+完整正确性矩阵如下。在同一个源码树里切换构建模式时，先清理上一次构建。
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+重点测试包括：
+
+- `src/test/recovery/t/074_umbra_torn_page_remap.pl`
+  - 在 md 模式下充当反向对照；
+  - 在 Umbra 模式下验证：即使 remap 后的新物理页被破坏，恢复仍然能够成功；
+  - 恢复后检查的是按顺序计算的 relation 摘要，而不只是 row count。
+
+## 8. 审阅结论应该关注什么
+
+- 架构层面的反馈；
+- WAL / remap 所有权模型的反馈；
+- redo 正确性的反馈；
+- `smgr` 边界的反馈；
+- checkpoint / 回写边界的反馈。
diff --git a/doc/umbra/UMBRA_FPW_STORY.md b/doc/umbra/UMBRA_FPW_STORY.md
new file mode 100644
index 0000000000..d15750cf15
--- /dev/null
+++ b/doc/umbra/UMBRA_FPW_STORY.md
@@ -0,0 +1,708 @@
+# Umbra FPW-to-Remap Design Story
+
+[Chinese](./UMBRA_FPW_STORY_ZH.md)
+
+## 1. Background: Which FPW Cost Umbra Targets
+
+PostgreSQL currently relies on full-page writes (FPW) for crash-recovery
+correctness.  The basic rule is that after a checkpoint, the first update to a
+page logs a full-page image into WAL and uses that image as the new recovery
+baseline.
+
+Umbra does not try to remove every full-page image.  The current implementation
+targets the ordinary checkpoint-boundary image path.  For ordinary data-page WAL
+records that satisfy the automatic remap conditions, Umbra replaces the default
+full-page image with remap-aware recovery metadata.  The following conservative
+paths still keep image ownership:
+
+- `REGBUF_FORCE_IMAGE`
+- `XLR_CHECK_CONSISTENCY`
+- `XLOG_FPI_FOR_HINT`
+
+So the more accurate goal is:
+
+- do not claim that all FPIs disappear
+- provide an alternative recovery-baseline representation for the ordinary
+  checkpoint-boundary case
+
+This is worth exploring because the stock `md` ordinary checkpoint-boundary
+image path repeatedly binds several costs together:
+
+- the first-dirty path after a checkpoint introduces an extra owner path, but
+  the more important point is the I/O cost behind it
+- WAL grows substantially because of full-page images, increasing WAL write and
+  sync pressure
+- data-file write amplification and WAL-side write amplification stack together
+- for update-heavy workloads, this I/O pressure repeats in every checkpoint
+  interval
+
+Umbra is not a local tweak to that path.  It tries to take over the path with a
+different way to express the recovery baseline.
+
+## 2. Current Scope, Non-Goals, and Terminology
+
+This section fixes the scope, non-goals, and terminology.
+
+Current scope:
+
+- use remap-based recovery metadata to take over the default ordinary
+  checkpoint-boundary FPW image path
+- discuss a PostgreSQL `storage manager` / physical storage-layer prototype,
+  not a new table AM or a general storage engine
+- use `P1-P9` to describe the semantic split of the current PoC branch, not to
+  describe an exact one-to-one mapping between arbitrary working branches and
+  patch numbers
+
+Current non-goals and conservative boundaries:
+
+- do not claim to remove every full-page image; `REGBUF_FORCE_IMAGE`,
+  `XLR_CHECK_CONSISTENCY`, and `XLOG_FPI_FOR_HINT` still keep image ownership
+- do not claim that compactor, AIO, primary/standby physical-page alignment,
+  `CREATE DATABASE` copy strategy, or explicit range-born protocol are fully
+  engineered and closed
+- do not treat `md + fpw=off` as a correctness-equivalent baseline
+
+Terms used throughout this document:
+
+- `birth`: the first durable `lblk -> pblk` relation for a logical page
+- `remap`: moving an existing logical page to a new physical page
+- `mapset`: direct publication of one mapping relation
+- ordinary checkpoint-boundary FPW: the ordinary first-dirty-after-checkpoint
+  path that defaults to a page image
+- reclaim boundary: the physical boundary up to which reclaim / unlink may
+  safely advance
+- `compactor`: the background organizer that scans sparse regions and moves
+  still-live pages away
+- `reclaim`: the lifecycle action that enters the safe unlink path once a
+  physical region is confirmed to have no live mappings
+
+## 3. Design Boundary: Storage Metadata and WAL Own the Recovery Core
+
+Umbra is deliberately narrow in scope.  It does not try to rewrite PostgreSQL's
+execution layer or spread changes into large upper-layer abstractions.
+
+More precisely, the current implementation is not a new table AM and not a
+standalone general-purpose storage engine.  It is closer to a prototype at the
+PostgreSQL `storage manager` / physical storage layer.  Upper layers still use
+logical block numbers; below `smgr`, Umbra provides `lblk -> pblk` translation
+for mapped forks.
+
+From the crash-recovery perspective, the correctness core converges on two
+layers:
+
+- storage metadata
+- WAL
+
+Storage metadata has two object types:
+
+- per-page map entry
+- fork-level superblock
+
+The "crash-recovery core" here does not mean that only these three modules
+participate in recovery.  It means that after a crash, redo needs three kinds of
+minimal durable truth to restore correct page contents:
+
+- map entry: which physical block a logical block should currently map to
+- superblock: fork-level boundary state, such as logical EOF, physical capacity,
+  and committed allocator frontier
+- WAL: which map-entry, superblock, and physical-page lifecycle changes were
+  atomically published, and in what order redo must replay them
+
+As long as those three facts stay consistent in redo, an ordinary remap update
+can be recovered as "old physical page plus WAL delta", without using a
+checkpoint-boundary full-page image as the recovery baseline.
+
+Runtime concurrency correctness has one more explicit mechanism:
+
+- inflight claim / barrier
+
+This mechanism is not WAL encoding and not durable truth in the recovery log.  It
+serializes publication order among foreground remap, background compactor
+relocation, and physical writes, so that the durable truth later written to WAL
+is itself valid.  A more precise split is:
+
+- durable truth for crash recovery mainly comes from `map entry + superblock +
+  WAL`
+- runtime concurrency correctness also explicitly depends on inflight / barrier
+
+## 4. Map Entry: Page-Level Mapping Truth
+
+Umbra's core abstraction is to split logical page identity from physical
+placement.
+
+In this model:
+
+- the logical page is the data-page identity that upper layers care about
+- the physical page is only the on-disk location currently carrying that logical
+  page
+- the map entry records the current `lblk -> pblk` relation
+
+Therefore, the map entry tells us which physical page currently stores one
+logical page.  It is the page-level local truth.
+
+## 5. Superblock: Fork-Level Global Truth
+
+A per-page map entry is not enough.  Many correctness properties are not
+expressible by one mapping alone; they depend on the boundary state of the
+whole fork.  The superblock owns that state.
+
+The superblock is Umbra's fork-level correctness anchor.  It maintains at least
+these facts:
+
+- logical boundary of the fork, such as `logical_nblocks`
+- committed physical allocation boundary, such as `next_free_pblkno`
+- materialized physical capacity
+- the safe boundary for reclaim / unlink
+- fork-level frontier facts needed by redo
+
+One easy-to-miss distinction is the runtime reservation frontier.  Runtime code
+also needs a reservation frontier in `MapSuperEntry` shared state to allocate
+new `pblk` values concurrently in foreground backends.  That frontier is not
+flushed to disk and does not participate in checkpoint.  The on-disk
+`next_free_pblkno` in the superblock only represents the committed frontier.
+The implementation should maintain and assert:
+`committed next_free <= reservation frontier`.
+
+In short:
+
+- map entry owns page-level mapping truth
+- superblock owns global fork-boundary truth
+
+Many extend, truncate, reclaim, unlink, and redo correctness properties
+ultimately depend on the superblock.
+
+## 6. WAL: Atomic Publication and Recoverability
+
+Storage metadata describes state, but state alone is not enough.  To take over
+the ordinary checkpoint-boundary image path, Umbra must publish and replay the
+following actions atomically:
+
+- birth
+- remap
+- mapset
+- related fork-level frontier changes, such as committed frontier, logical
+  size, and capacity
+
+That is why the WAL layer exists in this design.
+
+For the ordinary checkpoint-boundary path handled by Umbra, recovery
+correctness no longer depends on whether the record carries a full-page image.
+It depends on:
+
+- whether map entry and superblock state are correct
+- whether those state changes were recorded and replayed by WAL as atomic
+  events
+
+This must not be expanded into "Umbra no longer depends on images in any
+recovery path".  Conservative image owners still exist, and the paths listed
+above keep PostgreSQL's original image semantics.
+
+### 6.1 Lifecycle of the Old Physical Page Before Remap
+
+The old physical page in an ordinary remap is not a temporary page that can be
+immediately discarded or reused.  Before the remap is published, it is the
+committed physical baseline pointed to by the current map entry.  It is also
+the old baseline that no-image delta redo may need to read.
+
+The lifecycle is:
+
+- before remap publication, `old_pblk` is still the current durable mapping for
+  `lblk`; even if a backend has chosen `new_pblk`, `old_pblk` cannot be treated
+  as free space
+- after successful WAL insert, the remap record publishes the `old_pblk ->
+  new_pblk` transition as an atomic event; normal runtime state switches the map
+  entry to `new_pblk`
+- during crash recovery, for a no-image remap, redo first reads the old physical
+  page through `old_pblk`, applies the WAL delta on top of that old baseline,
+  and only then publishes `new_pblk`
+- after remap publication, `old_pblk` is no longer the current mapping of that
+  logical page, but it still does not enter foreground reuse; it only becomes a
+  candidate for background space cleanup, constrained by live mappings, reclaim
+  boundary, checkpoint, and redo semantics
+
+So Umbra does not turn "overwrite the old page" into "immediately reuse the old
+page".  It changes the recovery baseline: the old physical page remains
+available when a WAL record needs it, while the new physical page becomes the
+current mapping through remap.
+
+## 7. Foreground Policy: Allocate New Pages, Do Not Dispose of Old Pages
+
+If the goal is to make the ordinary checkpoint-boundary first-dirty path
+lighter, then immediate old-page reclaim, reusable-page search, and synchronous
+space cleanup should not be pushed back into the foreground path.
+
+The foreground tradeoff in Umbra is:
+
+- the foreground always takes a new physical page
+- the old physical page is not immediately reused in the foreground path; the
+  foreground only publishes the new mapping and does not dispose of the old page
+- the foreground does not perform immediate space cleanup
+
+That is, foreground allocation is closer to a monotonically advancing frontier.
+Old pages are not rewritten in the hot path, and the foreground does not tidy
+them up opportunistically.
+
+This does not mean the system never processes old pages.  It means the
+foreground does not own old-page disposal or long-term space convergence.  Later
+cleanup, reclaim, and unlink are background policy decisions.  Under high
+capacity pressure, the foreground may still trigger one-shot preallocation, but
+that does not move long-term cleanup back into the hot path.
+
+## 8. MAP Buffer and Mapwriter: Keep New Complexity in MAP Metadata
+
+Once logical pages are split from physical pages, MAP metadata becomes durable
+metadata in its own right.  It needs:
+
+- its own cache
+- its own I/O state
+- its own flush and extension maintenance path
+
+Therefore, in addition to PostgreSQL's existing data-page buffer pool, Umbra
+adds a buffer cache dedicated to MAP metadata.  This double-buffering shape
+cannot be completely removed, but the new buffering is mostly contained in the
+MAP metadata layer rather than spreading into a second generic data-page cache.
+
+More directly: `mapwriter` can be viewed as a MAP-metadata background writer
+modeled after PostgreSQL `bgwriter`, with one extra duty: physical
+preallocation for mapped forks near the low-water mark.  The current contract is
+closer to:
+
+- ordinary MAP pages are materialized on first dirty, not later by mapwriter or
+  checkpoint
+- checkpoint and mapwriter only flush ordinary MAP metadata blocks that already
+  exist
+- superblock flush is still owned by checkpoint
+- mapwriter owns ordinary MAP flush
+- mapwriter also owns background preallocation / physical capacity expansion
+- under high low-water pressure, the foreground may still perform one-shot
+  preallocation
+
+Therefore, mapwriter can be understood as "`bgwriter` for MAP metadata plus a
+physical-capacity preallocator".  It is not the owner of logical EOF, not the
+owner of superblock checkpoint flush, and not a generic data-page writer.
+
+## 9. Compactor: Long-Term Space Convergence
+
+The benefit of monotonic foreground allocation is a simple hot path.  The cost
+is that physical layout becomes sparse over time.  Opportunistic reclaim alone
+is not enough for long-term space convergence, so a background process is needed
+to move live pages out of sparse extents / segments and eventually create
+conditions for reclaim and segment unlink.
+
+That process is the compactor.  It is not the crash-recovery core, but it is the
+background mechanism that makes "foreground only publishes new mappings and does
+not dispose of old pages" sustainable over time.
+
+Compactor and reclaim are not synonyms.  Compactor scans, chooses candidate
+extents, relocates live pages, and advances the reclaim boundary when
+conditions allow.  Reclaim is the later lifecycle action: only when a segment is
+below the reclaim boundary and has no live mapping references does the physical
+unlink get handed to the sync-request / checkpointer path.
+
+The current compactor's first goal is not "clean as aggressively as possible".
+It is "avoid interfering with the foreground".  It is a best-effort, bounded,
+back-off-on-contention background organizer, not an aggressive reclaim sweeper.
+
+At a high level, it works like this:
+
+- scan MAP and count live block density by extent to find low-live-ratio
+  candidate regions
+- only process regions below the reclaim boundary and explicitly avoid the
+  current physical tail
+- relocate live pages in candidate extents by switching their mappings to new
+  physical pages
+- after an extent / segment becomes empty, defer real reclaim / unlink to the
+  later queue
+
+From the non-interference perspective, the important constraints are:
+
+- when foreground allocation pressure rises, compactor skips the round instead
+  of competing with foreground work
+- each round processes only a bounded number of relations and relocation moves
+- hot-path locking uses conditional acquire heavily; if a superblock or MAP
+  buffer is busy, the current implementation tends to skip rather than wait
+- relocation commits only if the old mapping is still the current published
+  truth; if the foreground already won an update, compactor abandons that move
+- real segment unlink is not executed synchronously by compactor; it is deferred
+  through the reclaim / sync-request path
+
+The result is that compactor behaves more like "gently yielding to foreground
+work" than "maximizing background cleanup throughput".  Its first job is to
+avoid noticeably slowing foreground allocation and writes.
+
+## 10. Inflight / Barrier: Foreground-Background Migration Concurrency
+
+Once both foreground code and compactor can migrate the same logical page, the
+system needs shared state to describe that a migration for this logical block is
+already in progress.  That is the role of inflight / barrier.
+
+This explicit mechanism exists because compactor relocation is currently a raw
+physical copy, not a shared-buffer-aware copy.  Without extra serialization,
+foreground remap, background relocation, and physical writes could conflict
+around the same `lblk`.
+
+Inflight / barrier is not a space-management policy.  It is the concurrency
+control for migration publication:
+
+- prevent concurrent publication of multiple new mappings for the same `lblk`
+- make the loser wait for stable committed MAP truth instead of borrowing
+  someone else's owner-local target
+- reduce foreground/background conflicts to owner / claim / barrier semantics
+
+The more precise split is:
+
+- `map entry + superblock + WAL` define the durable state truth that crash
+  recovery must restore
+- inflight / barrier defines how runtime code safely publishes those state
+  changes
+
+The former answers "what must redo ultimately restore"; the latter answers "who
+may publish this change during concurrent execution".
+
+## 11. File Deletion and Segment Lifecycle
+
+File deletion in Umbra is not a normal unlink problem.  It is about when a
+segment lifecycle reaches a safe deletion boundary.
+
+There are at least two cases:
+
+- truncate / drop driven deletion
+- reclaim deletion triggered after compactor cleanup
+
+The stable external contract is not the complete pending-state rule set.  It is
+the following boundary:
+
+- the superblock maintains the reclaim boundary
+- compactor uses published live mappings and live-map scans to decide whether a
+  candidate region still has live pages, and moves those live pages away
+- reclaim registers later physical unlink only when a segment is below the
+  reclaim boundary and has no live mapping references
+- real physical unlink is deferred through PostgreSQL's sync-request /
+  checkpointer path
+- redo must accept this lifecycle boundary instead of deciding only from "is the
+  file empty now"
+
+Inflight / pending state still affects internal correctness, but it is better
+treated as an implementation detail rather than the main criterion to explain in
+community-facing text.  The key external points are:
+
+- unlink is not "delete when empty"
+- unlink is constrained by reclaim boundary, live mappings, checkpoint, and redo
+  semantics
+
+## 12. Current Verification Status
+
+The current PoC verification target is: this owner / recovery model is
+executable on the covered paths.  It is not a proof that all boundaries are
+exhaustively covered.
+
+The basic verification already includes:
+
+- `make check` in `md` mode
+- `src/test/recovery check` in `md` mode
+- `make check` in `Umbra` mode
+- `src/test/recovery check` in `Umbra` mode
+
+Umbra-specific recovery TAP coverage further covers topics directly related to
+this design story:
+
+- MAP superblock / map fork policy / mapwriter activity
+- truncate / remap / 2PC remap / skip-WAL dense map redo
+- reclaim / internal segment unlink / compactor relocation
+- range remap zeroextend / ordinary slim block remap / compact birth block remap
+
+These tests support this statement:
+
+- the PoC is no longer only a design sketch; on the currently covered paths, it
+  has a minimal compile / regression / recovery loop
+
+But the following points cannot be claimed as fully closed based only on the
+current tests:
+
+- stronger proof and coverage for primary/standby physical-page alignment when
+  checkpoint cadence differs
+- `CREATE DATABASE` copy strategy: `FILE_COPY` is supported; `WAL_LOG` is not
+  supported yet, and with the Umbra storage manager enabled it falls back to
+  `FILE_COPY`
+- explicit `range-born / batch mapping publish` owner model
+- dedicated verification for internal metadata fork / MAP fork crossing
+  `RELSEG_SIZE` segment boundaries, for example beyond `1GB`
+- more complete native AIO paths and stronger methodology stress tests
+
+## 13. Performance Observations
+
+This section provides directional performance signals for the current PoC.  It
+does not attempt to make a strict benchmark claim.  The methodology is still
+thin: it lacks complete hardware details, repeated runs, error / variance
+ranges, and ablation for individual mechanisms.  The data below is better read
+as a directional observation, not as a formal community performance conclusion.
+
+The safer performance story should not treat `md + fpw=off` as a
+correctness-equivalent baseline.  The fair default baseline is:
+
+- `md + fpw=on`
+
+This point is better used as a mechanism upper-bound / sensitivity point:
+
+- `md + fpw=off`
+
+On `master`, under the same workload, we compared three modes:
+
+- `md + fpw=on`
+- `md + fpw=off`
+- `Umbra + fpw=on`
+
+Common settings:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+Raw throughput results are listed first.
+
+`checksum=off`
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+`checksum=on`
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+For WAL volume, under the same transaction count, the ratio
+`WAL(md + fpw=on) / WAL(Umbra + fpw=on)` is:
+
+`checksum=on`
+
+| clients | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+`checksum=off`
+
+| clients | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+These numbers show more directly that under the same transaction count, Umbra
+does not only recover throughput lost to ordinary checkpoint-boundary FPW; it
+also substantially reduces the corresponding WAL-volume pressure.  The gap
+widens as concurrency rises.
+
+From the raw numbers, `Umbra + fpw=on` shows clear and stable improvement over
+`md + fpw=on`:
+
+- with `checksum=off`, the improvement at 50 / 200 / 500 / 1000 clients is
+  about `+13.8% / +55.1% / +166.2% / +164.9%`
+- with `checksum=on`, the improvement at 50 / 200 / 500 / 1000 clients is about
+  `+8.1% / +51.1% / +130.3% / +138.7%`
+
+At 10 clients, all three results are close.  That looks more like low-concurrency
+noise or a non-FPW-dominated region.  The gap opens above 50 clients, where
+ordinary checkpoint-boundary I/O cost starts to accumulate repeatedly.
+
+At the same time, `Umbra + fpw=on` is close to, but does not fully reach, the
+`md + fpw=off` upper bound at most points:
+
+- with `checksum=off`, Umbra is already very close to the upper bound at 50 and
+  200 clients, but remains clearly behind `md + fpw=off` at 500 and 1000 clients
+- with `checksum=on`, this is more visible, suggesting that Umbra recovers much
+  of the ordinary-FPW-related I/O cost but does not consume all remaining system
+  cost
+
+The current data supports only a qualitative conclusion:
+
+- Umbra recovers a large part of the throughput lost by `md + fpw=on` on the
+  ordinary FPW path; the safer interpretation is recovery of related I/O cost
+- this benefit is visible with both `checksum=on` and `checksum=off`
+- `md + fpw=off` should only be treated as a sensitivity reference for "where
+  the system upper bound might be if this FPW cost is removed", not as a
+  semantic peer baseline
+
+The data is not enough for fine-grained attribution, such as "which foreground
+hot path contributes exactly how much" or "which mechanism is the fixed main
+source of benefit".  It supports:
+
+- a large part of the benefit is related to ordinary checkpoint-boundary FPW
+  being taken over by remap metadata, thereby recovering related I/O cost
+
+Further attribution would require dedicated ablation and a fuller methodology
+for WAL write / sync pressure, data write amplification, preallocation, and
+other sub-mechanisms.
+
+## 14. Open Engineering Work
+
+The following items should be described as follow-up work, not as completed
+capabilities:
+
+1. `compactor` engineering: the framework exists, but background convergence
+   efficiency and directory-discovery cost control are not fully engineered; the
+   sparse-segment discovery cost problem should not be described as solved; for
+   the PoC, this is engineering follow-up and does not block the minimal
+   remap/recovery loop.
+2. `CREATE DATABASE` copy strategy: PostgreSQL's existing `FILE_COPY` directory
+   / file copy path is supported; `WAL_LOG`, which copies database contents
+   block-by-block and logs each block to WAL, is not supported yet.  With the
+   Umbra storage manager enabled, it falls back to `FILE_COPY`.  This is an
+   explicit limitation and should not be overclaimed.
+3. `superblock shared-entry replacement`: the current shape is still closer to
+   allocate/free than replacement/eviction.  In practice, capacity pressure is
+   mitigated by increasing `map_superblocks`.  This remains engineering
+   follow-up.
+4. `AIO` integration: the necessary adaptation exists, but it is not a complete
+   Umbra-native rewrite.  The async I/O side should not be described as fully
+   closed.
+5. `range-born / batch mapping publish`: there is no explicit upper-layer
+   interface yet, so the current implementation mainly uses conservative `smgr`
+   fallback.  Multi-block extension still depends on compatibility with older
+   AM/WAL ordering.
+6. `primary/standby physical-page alignment`: the issue is identified.  The
+   current implementation adds a stronger publication / flush constraint such
+   as `FlushOneBuffer()` on the local no-image remap redo path and has local
+   recovery coverage, but primary/standby physical-page alignment should not be
+   described as systematically closed.
+
+Compressed into one sentence: the current PoC has established the core
+correctness / recovery loop and has a compile / regression / recovery shape, but
+it still carries explicitly marked host-tree follow-up work.
+
+## 15. Semantic Layers of the Current PoC
+
+This section only describes the semantic boundaries that each layer should own
+if the current PoC is organized as `P1-P9`.  It is not an exact mapping from
+arbitrary working branches to commit numbers, and it does not describe release
+cadence.
+
+The split should not be understood as a mechanical directory split.  It is a
+state-machine and owner-boundary split: earlier layers establish the minimal
+recoverable mechanism, while later layers add checkpoint, mapwriter, compactor,
+and other engineering capabilities.
+
+The purpose is for each layer to state which correctness owner or engineering
+boundary it introduces:
+
+- earlier layers should mostly establish base mechanisms, without mixing in
+  later engineering follow-up
+- later layers introduce WAL/redo, checkpoint, mapwriter, compactor, recovery
+  tests, and related capabilities
+- incomplete parts should be explicitly marked as follow-up rather than implied
+  as complete
+
+For the current PoC branch, a natural semantic split is `P1-P9`:
+
+- `P1`: establish the `smgr` implementation boundary, add the `--with-umbra`
+  choice point, and keep the ordinary `md` path unchanged
+- `P2`: introduce the `umfile` physical file layer and metadata storage
+  primitives; cover physical files, segments, create / unlink, read / write /
+  extend / truncate
+- `P3`: introduce the metadata disk format and identity-mapping bootstrap so
+  that metadata fork, superblock layout, and initial mapping state stand on
+  their own
+- `P4`: introduce the shared-memory MAP cache and checkpoint flush foundation,
+  so MAP metadata cache, materialization, dirty, and flush semantics stand on
+  their own
+- `P5`: introduce MAP access policy, logical-to-physical translation, and the
+  materialization contract, so `MAIN/FSM/VM` keep logical block numbers above
+  `smgr` while resolving `lblk -> pblk` below it
+- `P6`: introduce WAL records, mapped birth, and the redo state machine; build
+  the minimal WAL/redo owners for `MAP_SET`, truncate, metadata lifecycle, and
+  skip-WAL pending
+- `P7`: introduce ordinary remap, block-reference remap, and
+  checkpoint-boundary FPW replacement, closing the alternative representation
+  for the ordinary checkpoint-boundary image path
+- `P8`: add checkpoint / mapwriter writeback and physical preallocation, giving
+  clear owners to MAP metadata writeback, background preallocation, and one-shot
+  foreground preallocation under low-water pressure
+- `P9`: introduce the compactor framework and foreground non-interference
+  policy, converging inflight / barrier, reclaim, delayed unlink, and compactor
+  relocation into a background organization framework
+
+The point of this order is that earlier layers build the correctness owner
+model, while later layers handle engineering pressure.  Host-tree integration
+points such as `CREATE DATABASE` copy strategy, AIO, and primary/standby
+physical-page alignment should not be hidden as if the core mechanism already
+closed them.  They are better described as explicit follow-up.
+
+Thus, `P1-P9` expresses only the semantic boundary of the current PoC: which
+parts belong to the minimal correctness loop, which parts are engineering
+enhancements, and which parts are still compatibility fallback or follow-up.
+Tests and documentation should also belong to their related semantic layer,
+instead of being flattened into a generic "test / documentation layer".
+
+## 16. Summary
+
+Umbra does not claim that PostgreSQL no longer needs full-page images.  It also
+should not be broadly described as a new storage engine.  More accurately, it is
+a remap-based recovery-baseline representation for the ordinary
+checkpoint-boundary FPW path at the PostgreSQL `storage manager` / physical
+storage layer.  Its core is:
+
+- split logical page identity from physical placement in the storage layer
+- use map entries for page-level mapping truth
+- use the superblock for fork-level global truth
+- use WAL to publish state changes atomically and recoverably
+
+On top of that:
+
+- the foreground always allocates a new physical page and does not dispose of old
+  pages in the hot path
+- mapwriter mainly smooths MAP metadata in the background, rather than owning
+  all expansion
+- compactor owns long-term space convergence
+- inflight / barrier keeps foreground/background migration and physical writes
+  concurrency-safe
+- file deletion and segment lifecycle are constrained by reclaim boundary, live
+  mappings, checkpoint, and redo semantics
+
+The value of this design is not a single local optimization and not simply
+"turning FPW off".  It challenges the cost model bound to ordinary
+checkpoint-boundary FPW while trying to keep the crash-recovery semantics
+required by `md + fpw=on`.
+
+## Appendix: Implementation Transparency
+
+The implementation process should be transparent.  Umbra's core architecture,
+boundary definitions, and key state-machine reasoning come from the author's own
+design and prototyping work around PostgreSQL storage / WAL / recovery
+semantics.  The author also maintains the early `shadow` validation prototype:
+<https://github.com/nayishan/postgre_umbra/tree/shadow-pg12-archive>.
+
+To expand the prototype into the current PoC, the author used AI coding
+assistants such as Codex extensively for concrete implementation, boilerplate
+expansion, and local refactoring.  That work heavily depends on the prior logic
+analysis, the `shadow` prototype, and the shapes and call order of existing
+PostgreSQL implementation.
+
+The responsibility boundary is also important: core design, boundary definition,
+and key logic decisions are the author's responsibility; AI mainly accelerates
+tedious implementation details.  Current AI systems still cannot independently
+reason about database-kernel concurrency timing, owner models, or
+crash-recovery semantics.  Some areas may therefore still show style
+inconsistency or require further engineering convergence.  The current status is
+PoC, not a finished product with final host-tree polish.
diff --git a/doc/umbra/UMBRA_FPW_STORY_ZH.md b/doc/umbra/UMBRA_FPW_STORY_ZH.md
new file mode 100644
index 0000000000..f42a42a334
--- /dev/null
+++ b/doc/umbra/UMBRA_FPW_STORY_ZH.md
@@ -0,0 +1,500 @@
+# Umbra 用 remap 替代 ordinary checkpoint-boundary FPW 路径的中文说明
+
+[English](./UMBRA_FPW_STORY.md)
+
+## 1. 背景：Umbra 挑战的是哪一段 FPW 成本
+
+PostgreSQL 当前依赖 full-page writes（FPW）来保证崩溃恢复正确性。其基本做法是：在 checkpoint 之后，某个页面第一次被修改时，把整页镜像写入 WAL，用它作为新的恢复基线。
+
+Umbra 当前实现挑战的，不是“全局取消所有 full-page image”，而是 ordinary checkpoint-boundary 这条默认 image 路径。对满足自动 remap 条件的普通数据页 WAL 记录，Umbra 用 remap-aware recovery metadata 取代默认 full-page image；但以下保守路径仍然保留 image 语义：
+
+- `REGBUF_FORCE_IMAGE`
+- `XLR_CHECK_CONSISTENCY`
+- `XLOG_FPI_FOR_HINT`
+
+因此，Umbra 当前更准确的目标是：
+
+- 不去宣称“所有 FPW 都消失了”
+- 而是为 ordinary checkpoint-boundary case 提供另一种恢复基线表达方式
+
+这套做法之所以值得尝试，是因为 stock md 的 ordinary checkpoint-boundary image path 确实绑定了几类反复出现的成本：
+
+- checkpoint 边界后的 first-dirty 路径会引入一条额外的 owner path，但更值得强调的是它背后绑定的 I/O 成本
+- WAL 会因为 full-page image 明显膨胀，从而带来更高的 WAL 写入与同步压力
+- 数据文件侧的写放大和 WAL 侧的写放大会叠加出现
+- 在更新密集型 workload 下，这类 I/O 压力会在每个 checkpoint 区间反复出现
+
+Umbra 的目标，不是对这条路径做局部微调，而是尝试用不同的恢复基线表达方式来接管它。
+
+## 2. 当前范围、非目标和术语约定
+
+下面先把当前范围、非目标和术语约定说清楚。
+
+当前范围是：
+
+- 目标是用 remap-based recovery metadata 接管 ordinary checkpoint-boundary FPW 的默认 image 路径
+- 讨论对象是 PostgreSQL `storage manager` / 物理存储层原型，而不是新的 table AM 或通用“存储引擎”
+- 本文里的 `P1-P9` 用来描述当前 PoC 分支的语义拆分，不用来描述任意工作分支与 patch 编号的一一对应关系
+
+当前非目标或保守保留边界是：
+
+- 不宣称取消所有 full-page image；`REGBUF_FORCE_IMAGE`、`XLR_CHECK_CONSISTENCY`、`XLOG_FPI_FOR_HINT` 等路径仍保留 image owner
+- 不把 compactor、AIO、主备物理页一致性、`CREATE DATABASE` 复制路径、显式 range-born 协议说成已经工程化收敛的能力
+- 不把 `md + fpw=off` 当成 correctness-equivalent baseline
+
+本文中几个高频术语的约定是：
+
+- `birth`：一个逻辑页第一次获得持久 `lblk -> pblk` 关系
+- `remap`：已有逻辑页切换到新的物理页
+- `mapset`：直接发布一条映射关系
+- ordinary checkpoint-boundary FPW：checkpoint 之后 ordinary first-dirty 默认走 image 的那类路径
+- reclaim boundary：可以安全推进 reclaim / unlink 的物理边界
+- `compactor`：后台整理器，负责扫描稀疏区域并把仍然 live 的页面搬走
+- `reclaim`：生命周期动作，负责在某段物理空间确认没有 live mapping 后进入
+  安全删除 / unlink 路径
+
+## 3. 设计边界：崩溃恢复核心收敛在 storage metadata 和 WAL
+
+Umbra 的实现范围是刻意克制的。它不试图重写 PostgreSQL 的执行层，也不希望把变更扩散到更高层的大面积抽象中。
+
+更准确地说，当前实现讨论的不是一个新的 table AM，也不是独立于 PostgreSQL 的通用“存储引擎”；它更接近 PostgreSQL `storage manager` / 物理存储层上的一个原型。上层仍然使用逻辑块号，而 Umbra 在 `smgr` 之下为 mapped fork 提供 `lblk -> pblk` 的翻译。
+
+从 crash recovery 的角度看，它的 correctness core 主要收敛在两层：
+
+- storage metadata
+- WAL
+
+其中，storage metadata 又分成两类对象：
+
+- per-page map entry
+- fork-level superblock
+
+这里说的 crash-recovery core，不是指只有这三个模块参与恢复，而是指崩溃后
+redo 要恢复出正确页面内容时，最小的持久事实来自三类信息：
+
+- map entry：说明某个逻辑块当前应该对应哪个物理块
+- superblock：说明这个 fork 的全局边界状态，例如 logical EOF、physical
+  capacity、已提交 allocator frontier
+- WAL：说明哪些 map entry / superblock / 物理页生命周期变化已经被原子发布，
+  以及 redo 必须按什么顺序重放它们
+
+只要这三类事实在 redo 中保持一致，ordinary remap 更新就可以按“旧物理页 +
+WAL delta”恢复，而不需要 checkpoint-boundary full-page image 作为恢复基线。
+
+但如果讨论的是运行时并发正确性，当前实现还显式依赖一层额外机制：
+
+- inflight claim / barrier
+
+它不属于 WAL 编码本身，也不是恢复日志里的持久真相；它负责把前台 remap、
+后台 compactor relocation、以及物理写入之间的发布顺序串行化，保证之后写入
+WAL 的持久事实本身是成立的。因此更准确的说法是：
+
+- crash-recovery core 的持久事实主要来自 `map entry + superblock + WAL`
+- runtime concurrency correctness 还显式依赖 inflight / barrier
+
+## 4. map entry：负责单页映射真相
+
+Umbra 的核心抽象，是把逻辑页身份和物理放置拆开。
+
+在这套模型下：
+
+- 逻辑页代表上层真正关心的数据页身份
+- 物理页只是当前承载该逻辑页内容的落盘位置
+- map entry 负责记录当前 `lblk -> pblk` 的对应关系
+
+因此，单个页面当前映射到哪个物理页，由 map entry 给出。它解决的是单页级别的局部真相。
+
+## 5. superblock：负责 fork 级全局真相
+
+仅有 per-page map entry 还不够，因为很多正确性并不是单个映射能表达的，而是整个 fork 的边界状态决定的。这部分由 superblock 承担。
+
+superblock 是 Umbra 里的 fork-level correctness anchor，负责维护至少以下几类状态：
+
+- 当前 fork 的逻辑边界，例如 `logical_nblocks`
+- 已经提交的物理分配边界，例如 `next_free_pblkno`
+- 当前已经 materialized 到哪里的物理容量状态
+- reclaim / unlink 可以推进到哪条安全边界
+- redo 需要恢复的 fork-level frontier facts
+
+这里需要额外强调一个容易混淆的点：运行时还需要一个只存在于
+`MapSuperEntry` shared state 里的 reservation frontier，用来给并发前台分配
+新 `pblk`。这个 frontier 不落盘、不参与 checkpoint；落盘进入 superblock 的
+`next_free_pblkno` 只能表示 committed frontier。实现上应满足并显式断言：
+`committed next_free <= reservation frontier`。
+
+换句话说：
+
+- map entry 负责单页映射真相
+- superblock 负责全局边界真相
+
+很多 extend、truncate、reclaim、unlink、redo 相关的正确性，最终都依赖 superblock 才能成立。
+
+## 6. WAL：负责动作的原子发布与可恢复性
+
+storage metadata 负责描述状态本身，但仅有状态还不够。Umbra 要接管 ordinary checkpoint-boundary image path，就必须保证以下动作可以被原子地发布，并在 redo 中被一致地重建：
+
+- birth
+- remap
+- mapset
+- 与之相关的 committed frontier、logical size、capacity 等 fork-level 状态推进
+
+这就是 WAL 层存在的理由。
+
+对被 Umbra 接管的 ordinary checkpoint-boundary 路径来说，恢复正确性不再依赖“这条记录是否携带 full-page image”，而是依赖：
+
+- map entry 和 superblock 描述的状态是否正确
+- 这些状态变化是否被 WAL 作为原子事件记录并重放
+
+但这不应被扩写成“Umbra 的所有恢复路径都不再依赖 image”。保守 image owner 仍然存在，上面列出的几类路径仍保持 PostgreSQL 原有的 image 语义。
+
+### 6.1 remap 前旧物理页的生命周期
+
+ordinary remap 里的旧物理页不是一个可以立刻丢掉或复用的临时页。它在 remap
+发生前，是 map entry 当前指向的 committed physical baseline，也是无 image
+delta redo 可能需要读取的旧基线。
+
+这条生命周期可以按下面几步理解：
+
+- remap 发布前：`old_pblk` 仍然是 `lblk` 的当前持久映射；即使 backend 已经
+  选出了 `new_pblk`，也不能把 `old_pblk` 当成空闲页处理
+- WAL insert 成功后：remap record 把 `old_pblk -> new_pblk` 的转换作为原子
+  事件发布；正常运行时 map entry 会切到 `new_pblk`
+- crash recovery 时：如果这是 no-image remap，redo 先通过 `old_pblk` 读取旧
+  物理页，把 WAL delta 作用在这个旧基线上，然后再发布 `new_pblk`
+- remap 发布后：`old_pblk` 不再是该逻辑页的当前映射，但它也不会进入前台
+  复用路径；它只能作为后台空间整理的候选对象，受 live mapping、reclaim
+  boundary、checkpoint 和 redo 语义共同约束
+
+所以，Umbra 不是把“旧页覆盖写”改成“旧页立即复用”。它真正改变的是恢复基线：
+旧物理页在 WAL record 需要时仍然保留为可读取基线，新物理页则通过 remap
+成为新的当前映射。
+
+## 7. 前台策略：关键路径只分配新页，不处置旧页
+
+如果目标是把 ordinary checkpoint-boundary 的 first-dirty 路径变轻，就不应该再把“即时回收旧页、寻找可复用页、同步整理空间”这些工作塞回前台。
+
+Umbra 在前台路径上的取舍是：
+
+- 前台总是拿新物理页
+- 旧物理页在前台路径上不会被即时复用；前台只发布新映射，不负责处置旧页
+- 前台不负责即时空间整理
+
+也就是说，前台更接近单调前进的 frontier 分配。旧页不会在热路径上被重新拿来写，前台也不会顺手把旧页整理掉。
+
+这不是说系统永远不处理旧页，而是说前台不承担旧页处置和空间收敛。后续是否整理、何时 reclaim / unlink，由后台策略决定。在容量压力较高时，前台仍可能触发一次 one-shot preallocation，但它不会把长期空间整理重新拉回热路径。
+
+## 8. MAP buffer 和 mapwriter：新增复杂度主要被限制在 MAP 元数据层
+
+逻辑页和物理页拆开之后，MAP 元数据成为一等持久元数据。它需要：
+
+- 自己的缓存
+- 自己的 I/O 状态
+- 自己的刷写和扩张维护路径
+
+因此，在 PostgreSQL 原有的数据页 buffer pool 之外，Umbra 额外增加了一套专门服务于 MAP 元数据的 buffer cache。这个双层 buffer 问题不能完全消除，但新增的 buffering 主要被限制在 MAP 元数据层，而不是扩散成第二套通用数据页缓存。
+
+这里也可以更直白地描述：`mapwriter` 可以看成是仿照 PostgreSQL `bgwriter` 的一套 MAP 后台写回机制，但它比 `bgwriter` 额外多承担了一项工作：在低水位附近为 mapped fork 做后台 physical preallocation。当前代码里的 contract 更接近：
+
+- ordinary MAP page 在 first dirty 时 materialize，而不是等 mapwriter / checkpoint 再去创建物理块
+- checkpoint 和 mapwriter 只刷“已经存在”的 ordinary MAP metadata block
+- superblock flush 仍由 checkpoint 拥有
+- mapwriter 负责 ordinary MAP flush
+- mapwriter 还负责后台 preallocation / physical capacity 扩张
+- 前台在低水位压力过高时，仍可能自己做一次 one-shot preallocation
+
+因此，mapwriter 可以被理解成“MAP 元数据层上的 `bgwriter` + 物理容量预分配器”。但它不是 logical EOF 的 owner，不负责 superblock 的 checkpoint flush，也不是普通数据页写线程。
+
+## 9. compactor：负责长期空间收敛
+
+前台采用单调前进分配的好处是热路径简单，代价是物理布局会随着时间推移逐渐稀疏化。仅靠“随手回收”并不足以让长期空间占用收敛，因此需要一个后台进程负责把 live page 从稀疏 extent / segment 里迁走，最终为 reclaim 和 segment unlink 创造条件。
+
+这个进程就是 compactor。它不是 crash-recovery core 本身，但它是让“前台只发布新映射、不处置旧页”这条策略在长期空间占用上可持续的关键后台机制。
+
+这里的 compactor 和 reclaim 不是同义词。compactor 的主要动作是扫描、选择候选
+extent、relocate live page，并在条件满足时推进 reclaim boundary；reclaim 则是
+后续生命周期动作，只有在某个 segment 已经低于 reclaim boundary 且确认没有 live
+mapping 引用时，才把真正的物理 unlink 交给 sync-request / checkpointer 路径。
+
+不过，当前实现里 compactor 的首要目标并不是“尽可能快地清理干净”，而是“尽量不要干扰前台”。更准确地说，它现在是一个 best-effort、bounded、遇忙就退的后台整理器，而不是一个 aggressively reclaim 的空间清扫器。
+
+当前实现里，它大致按下面的顺序工作：
+
+- 先扫 MAP，按 extent 统计 live block 密度，找出 live 比例很低的候选区域
+- 只处理已经落在 reclaim boundary 之下的区域，并显式避开当前物理尾部
+- 对候选 extent 里的 live page 做 relocation，把映射切到新的物理页
+- 当某个 extent / segment 被搬空后，再把真正的 reclaim / unlink 延后交给后续队列处理
+
+如果从“不要干扰前台”这个角度看，当前实现最重要的约束其实是这些：
+
+- 当前台分配压力升高时，compactor 会直接跳过这一轮，而不是继续和前台抢资源
+- 每轮只处理有限数量的 relation，也只允许有限数量的 relocation move，而不是无限制地清理
+- 关键路径上的锁获取大量使用 conditional acquire；遇到正在被别人使用的 superblock 或 MAP buffer，当前实现更倾向于跳过，而不是等待
+- relocation 只有在旧映射仍然是当前已发布真相时才会提交；如果前台已经赢了更新，compactor 就放弃这次搬迁
+- 真正的 segment unlink 也不是 compactor 当场同步执行，而是通过后续 reclaim / sync-request 路径延后处理
+
+这套取舍意味着：当前 compactor 更像“温和地给前台让路”，而不是“最大化后台清理吞吐”。它首先保证前台的分配和写入不被后台整理明显拖慢。
+
+## 10. inflight / barrier：负责前后台迁移并发的一致性
+
+一旦前台和 compactor 都可能迁移同一个逻辑页，就必须有共享状态描述“这个逻辑页的迁移已经在进行中”。这就是 inflight / barrier 的职责。
+
+当前实现里，这层机制之所以是显式的，是因为 compactor relocation 目前仍是 raw physical copy，而不是 shared-buffer-aware copy。没有额外串行化的话，前台 remap、后台 relocation、以及物理写入就可能围绕同一个 `lblk` 发生冲突。
+
+inflight / barrier 解决的不是空间管理策略，而是迁移动作的并发一致性：
+
+- 防止同一个 `lblk` 被并发发布多个新映射
+- 让 loser 等到稳定的 committed MAP truth，而不是借用别人的 owner-local target
+- 把前后台冲突收敛成 owner / claim / barrier 语义
+
+因此，更准确的划分是：
+
+- `map entry + superblock + WAL` 定义 crash recovery 需要恢复的持久状态真相
+- inflight / barrier 定义运行时如何安全地发布这些状态变化
+
+前者解决“redo 最终要恢复什么”，后者解决“并发执行时谁可以发布这个变化”。
+
+## 11. 文件删除与 segment 生命周期
+
+Umbra 里的文件删除不是普通的 unlink 问题，而是 segment 生命周期何时进入安全删除边界的问题。
+
+这至少涉及两类场景：
+
+- truncate / drop 驱动的删除
+- compactor 整理后触发的 reclaim 删除
+
+当前分支中，对外更适合描述的 contract 不是“pending 规则的全部细节”，而是下面这条更稳定的边界：
+
+- superblock 维护 reclaim boundary
+- compactor 通过已发布的 live mapping 和 live-map scan 判断候选区域是否仍有
+  live page，并把 live page 搬走
+- reclaim 在 segment 已经低于 reclaim boundary 且没有 live mapping 引用时，
+  才注册后续物理 unlink
+- 真正的物理 unlink 通过 PostgreSQL 的 sync-request / checkpointer 路径延后执行
+- redo 侧要能接受这套生命周期边界，而不是只看“文件现在是不是空的”
+
+inflight / pending 状态当然仍然影响内部正确性，但它们更适合被看作实现细节，而不是对社区信件里要展开的主判据。对外最重要的点仍然是：
+
+- unlink 不是“看空即删”
+- unlink 受 reclaim boundary、live mapping、checkpoint 和 redo 语义共同约束
+
+## 12. 验证现状
+
+当前 PoC 的验证重点是“这套 owner / recovery model 在已覆盖路径上是可执行的”，而不是“所有边界都已经被穷尽证明”。
+
+目前至少已经覆盖了下面这类基础验证：
+
+- `md` 模式下的 `make check`
+- `md` 模式下的 `src/test/recovery check`
+- `Umbra` 模式下的 `make check`
+- `Umbra` 模式下的 `src/test/recovery check`
+
+Umbra 专用 recovery TAP 进一步覆盖了几类与本文主线直接相关的主题，包括：
+
+- MAP superblock / map fork policy / mapwriter activity
+- truncate / remap / 2PC remap / skip-WAL dense map redo
+- reclaim / internal segment unlink / compactor relocation
+- range remap zeroextend / ordinary slim block remap / compact birth block remap
+
+这些验证更适合支撑下面这条表述：
+
+- 这套 PoC 已经不是只停留在设计层，而是在当前覆盖路径上具备可编译、可回归、可恢复的最小闭环
+
+但下面这些点仍不能仅凭现有测试就宣称已经完全收敛：
+
+- 主库和备库 checkpoint 节奏不同时，物理页对齐问题的系统性证明与更强测试覆盖
+- `CREATE DATABASE` 的复制策略：当前支持 `FILE_COPY`；尚未支持 `WAL_LOG`，启用
+  Umbra storage manager 时会回退到 `FILE_COPY`
+- 显式 `range-born / batch mapping publish` 所有权模型
+- 内部 metadata fork / MAP fork 跨 `RELSEG_SIZE` segment 边界（例如超过 `1GB`）的专项验证
+- 更完整的原生 AIO 路径与更强的方法学压力测试
+
+## 13. 性能观察（定性，不构成严格 benchmark 结论）
+
+本节只提供当前 PoC 的定性性能信号，而不试图给出严格 benchmark 结论。方法学目前仍然偏薄：这里缺少完整硬件说明、重复轮次、误差/方差范围，以及针对各子机制的 ablation，因此下面的数据更适合作为“方向性观察”，而不是可直接用于社区性能结论的正式方法学结果。
+
+当前更稳妥的性能叙事，不应该把 `md + fpw=off` 当成 correctness-equivalent baseline。真正公平、语义对等的默认比较对象是：
+
+- `md + fpw=on`
+
+而下面这个点更适合作为“机制上界 / sensitivity point”：
+
+- `md + fpw=off`
+
+在 `master`、相同 workload 下，我们比较了三种模式：
+
+- `md + fpw=on`
+- `md + fpw=off`
+- `Umbra + fpw=on`
+
+共同测试条件如下：
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+为了避免只看百分比，下面先直接列出原始吞吐结果。
+
+`checksum=off`
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+`checksum=on`
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+如果只看 `md + fpw=on` 与 `Umbra + fpw=on` 之间的 WAL 体积差异，在相同事务量下，按 `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` 计算，得到的比值如下：
+
+`checksum=on`
+
+| 并发 | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+`checksum=off`
+
+| 并发 | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+这组数更直接地说明：在相同事务量下，Umbra 不只是回收了 ordinary checkpoint-boundary FPW 相关的吞吐损失，也明显压低了对应的 WAL 体积压力；并且随着并发升高，这种差距会进一步拉大。
+
+从这些原始数值看，`Umbra + fpw=on` 相对 `md + fpw=on` 的提升是明显且稳定的：
+
+- `checksum=off` 时，50 / 200 / 500 / 1000 并发下分别约为 `+13.8% / +55.1% / +166.2% / +164.9%`
+- `checksum=on` 时，50 / 200 / 500 / 1000 并发下分别约为 `+8.1% / +51.1% / +130.3% / +138.7%`
+
+10 并发点上，三组结果非常接近，更多像低并发区间的噪声或非 FPW 主导区；真正把差距拉开的，是 50 以上并发时 ordinary checkpoint-boundary 路径相关 I/O 成本开始反复累计的那一段。
+
+同时，`Umbra + fpw=on` 在大部分点上接近但没有完全达到 `md + fpw=off` 这条上界：
+
+- `checksum=off` 时，Umbra 在 50 和 200 并发下已经非常接近上界，但在 500 和 1000 并发下仍明显落后于 `md + fpw=off`
+- `checksum=on` 时，这个现象更明显，说明 Umbra 回收了大部分 ordinary FPW 相关 I/O 成本，但还没有吃满全部剩余系统成本
+
+因此，当前数据更适合支撑下面这条定性结论：
+
+- Umbra 明显回收了 `md + fpw=on` 相对 ordinary FPW 路径损失掉的大部分吞吐，而这更稳妥地可以理解为对相关 I/O 成本的回收
+- 这种收益在 `checksum=on` 和 `checksum=off` 两种条件下都能观察到
+- `md + fpw=off` 只能作为“如果去掉这类 FPW 成本，系统上界大概在哪里”的敏感性参照，而不能被当成语义对等基线
+
+这组数据也说明，现阶段不宜再给出过细的绝对收益归因，例如“某个前台热路径固定贡献多少、主要收益固定来自哪一项”。当前数据能支撑的是：
+
+- 很大一部分收益确实与 ordinary checkpoint-boundary FPW 路径被 remap metadata 接管、从而回收相关 I/O 成本有关
+
+但如果要继续把收益拆成“WAL 写入与同步压力下降贡献多少、数据写放大下降贡献多少、preallocation 贡献多少”，还需要专门的 ablation 和更完整的方法学说明。
+
+## 14. 当前还没做完的工程点
+
+下面这些项更适合被明确写成 follow-up，而不是暗示成已经完成的能力：
+
+1. `compactor` 工程化：已有框架，但后台收敛效率和目录发现成本控制还未完全工程化；对外仍不能把稀疏 segment 发现成本问题说成已解决；对 PoC 而言属于工程 follow-up，不阻断最小 remap/recovery 闭环。
+2. `CREATE DATABASE` 复制策略：当前支持 PostgreSQL 既有 `FILE_COPY` 目录 / 文件复制路径；尚未支持 `WAL_LOG` 逐块复制并逐块写 WAL 的路径，启用 Umbra storage manager 时会回退到 `FILE_COPY`；这属于明确限制项，不应外扩表述。
+3. `superblock shared-entry replacement`：当前仍更接近 allocate/free，而不是 replacement/eviction；现实上需要靠调大 `map_superblocks` 兜底容量压力；这属于工程 follow-up。
+4. `AIO` 集成：已完成必要适配，但不是完整的 Umbra 原生重构；不能宣称异步读写侧已经完全收敛；这仍是工程 follow-up。
+5. `range-born / batch mapping publish`：缺少上层显式接口，当前主要由 `smgr` 层保守兜底；一次性扩展多块的场景仍依赖兼容旧 AM/WAL 顺序的实现；这属于设计/工程 follow-up。
+6. `主备物理页对齐`：这类问题已经被明确识别；当前实现在 `no-image remap redo` 这条局部路径上已经显式加入了 `FlushOneBuffer()` 这类更强的发布 / 落盘约束，并有局部 recovery 覆盖；但还不能把主备物理页对齐说成已经系统性收敛，因此仍应限制更强的复制/恢复一致性表述。
+
+如果要再压缩成一句话，那么当前 PoC 更接近“核心 correctness / recovery 闭环已经建立，并且已经具备可编译、可回归、可恢复的基本形态，只是仍带着一组明确标注的 host-tree follow-up”。
+
+## 15. 当前 PoC 的语义分层
+
+这一节只说明当前 PoC 如果按 `P1-P9` 组织时，各层应该承担的语义边界。它不是
+任意工作分支与提交编号的映射，也不描述后续发布节奏。
+
+这个分层不应该按文件目录机械切开，而应该按状态机和 owner 边界来理解：前面的
+层次先建立可恢复的最小机制，后面的层次再补齐 checkpoint、mapwriter、
+compactor 等工程化能力。
+
+这个分层的目标，是让每一层都能说明自己引入了哪个 correctness owner 或工程边界：
+
+- 前面的层次尽量只建立基础机制，不把后续工程化 follow-up 混进去
+- 后面的层次再逐步引入 WAL/redo、checkpoint、mapwriter、compactor、recovery
+  tests 等能力
+- 对当前还没做完的部分，要明确写成 follow-up，而不是隐含成已经完成的能力
+
+按当前 PoC 分支定义，更自然的拆分顺序应该收敛成 `P1-P9`：
+
+- `P1`：建立 `smgr` 实现边界，引入 `--with-umbra` 选择点，保证普通 `md`
+  路径不被改变
+- `P2`：引入 `umfile` 物理文件层和 metadata storage primitive，先补齐物理文件、
+  segment、create / unlink、read / write / extend / truncate 等底层能力
+- `P3`：引入 metadata 磁盘格式和 identity mapping 启动路径，让 metadata fork、
+  superblock layout、初始映射状态先独立成立
+- `P4`：引入共享内存 MAP cache 和 checkpoint flush 基础，让 MAP metadata 的缓存、
+  materialize、dirty / flush 语义先独立成立
+- `P5`：引入 MAP 访问策略、逻辑到物理的翻译，以及 materialization contract，让
+  `MAIN/FSM/VM` 在上层继续使用逻辑块号，在 `smgr` 之下完成 `lblk -> pblk`
+  解析
+- `P6`：引入 WAL record、mapped birth 和 redo 状态机，先把 `MAP_SET`、truncate、
+  metadata lifecycle、skip-WAL pending 等最小 WAL/redo owner 建起来
+- `P7`：引入 ordinary remap、block reference remap 和 checkpoint-boundary FPW
+  replacement，让 ordinary checkpoint-boundary image path 的替代表达方式闭环
+- `P8`：补齐 checkpoint / mapwriter 回写和物理预分配，让 MAP metadata 回写、
+  后台预分配、低水位压力下的一次性前台预分配有清晰 owner
+- `P9`：引入 compactor 框架和不干扰前台的策略，把 inflight / barrier、reclaim、
+  delayed unlink、compactor relocation 收敛成后台整理框架
+
+这个顺序的重点是：前面的层次先把 correctness owner model 建起来，后面的层次再逐步处理工程化压力。尤其是 `CREATE DATABASE` 复制策略、AIO、主备物理页一致性这些 host-tree 集成问题，不应该被伪装成核心机制已经收敛的一部分；它们更适合作为明确的 follow-up 单独讨论。
+
+因此，这里的 `P1-P9` 只表达当前 PoC 的语义边界：哪些属于核心正确性最小闭环，哪些属于工程化增强，哪些仍然只是兼容性兜底或后续集成项。测试和文档也应该按相关语义层次归属，而不是单独抽成一个泛化的“测试/文档层”。
+
+## 16. 总结
+
+Umbra 不是在宣称“PostgreSQL 从此不再需要 full-page image”，也不该被宽泛地描述成一个新的“存储引擎”；更准确地说，它是在 PostgreSQL `storage manager` / 物理存储层上，针对 ordinary checkpoint-boundary FPW 路径提供一套 remap-based 的恢复基线表达方案。它的核心是：
+
+- 在 storage 层把逻辑页身份和物理放置拆开
+- 用 map entry 描述单页映射真相
+- 用 superblock 描述 fork 级全局真相
+- 用 WAL 保证这些状态变化以原子、可恢复的方式发布
+
+在此基础上：
+
+- 前台总是分配新物理页，不在热路径中处置旧页
+- mapwriter 主要负责 MAP 元数据侧的后台平滑，而不是独占所有扩张责任
+- compactor 负责长期空间收敛
+- inflight / barrier 保证前后台迁移和物理写入的并发安全
+- 文件删除和 segment 生命周期由 reclaim boundary、live mapping、checkpoint 和 redo 语义共同约束
+
+这套设计的价值，不是单点微优化，也不是一句“关闭 FPW”就能概括。它真正挑战的是 ordinary checkpoint-boundary FPW 所绑定的那组成本模型，同时尽量保持 `md + fpw=on` 所要求的 crash-recovery 语义。
+
+## 附录：实现过程透明度声明
+
+这组代码的形成过程需要保持透明。Umbra 的核心架构、边界划分和关键状态机，
+来自作者本人对 PostgreSQL storage / WAL / recovery 语义的设计和原型化工作。
+作者也维护了早期验证原型 `shadow`：
+<https://github.com/nayishan/postgre_umbra/tree/shadow-pg12-archive>。
+
+为了把原型扩展成当前规模的 PoC，作者在具体实现、样板代码扩展和局部重构中
+大量使用了 AI 编码助手（例如 Codex）。这些实现大量依赖前面的逻辑梳理、
+`shadow` 原型，以及 PostgreSQL 现有实现的代码形状和调用顺序。
+
+这里的责任边界也需要说清楚：核心设计、边界定义和关键逻辑判断由作者负责；
+AI 主要用于加速繁琐实现细节。当前 AI 仍不能独立把握数据库内核中的并发时序、
+owner model 和 crash-recovery 语义，因此代码中仍可能存在风格不统一或需要后续
+工程化收敛的区域。当前定位仍是 PoC，而不是完成最终 host-tree polish 的成品。
diff --git a/doc/umbra/WAL_AND_REDO.md b/doc/umbra/WAL_AND_REDO.md
new file mode 100644
index 0000000000..3feddd66a6
--- /dev/null
+++ b/doc/umbra/WAL_AND_REDO.md
@@ -0,0 +1,419 @@
+# Umbra WAL and Redo Semantics on PostgreSQL Master
+
+This document describes the current WAL payload and redo rules used by the
+PostgreSQL master Umbra PoC.
+
+The design has two WAL-visible pieces:
+
+- remap metadata attached to ordinary block references
+- Umbra rmgr records for MAP lifecycle operations
+
+Those two mechanisms are complementary.  They solve different problems and are
+replayed in different layers.
+
+## 1. Ordinary Block Records with Remap Metadata
+
+Umbra extends ordinary WAL block references with an extra block-header payload
+when `BKPBLOCK_HAS_REMAP` is set.
+
+The full payload is:
+
+- `old_pblkno`
+- `new_pblkno`
+- `logical_nblocks`
+- `next_free_pblkno`
+
+The meaning of those fields is:
+
+- `old_pblkno`
+  - the old published physical baseline for this logical block
+  - `InvalidBlockNumber` means first published mapping
+- `new_pblkno`
+  - the physical block that becomes the new published target
+- `logical_nblocks`
+  - logical frontier payload needed when redo is publishing a first-born page
+- `next_free_pblkno`
+  - allocator frontier payload that keeps replay-side physical allocation
+    deterministic
+
+The remap header does not try to encode every superblock fact.  It only carries
+the block-local transition plus the frontier state redo needs to keep the MAP
+view deterministic.
+
+The current record-level remap format is encoded in `xl_info`:
+
+- full remap:
+  - `old_pblkno`
+  - `new_pblkno`
+  - `logical_nblocks`
+  - `next_free_pblkno`
+- compact birth:
+  - `new_pblkno`
+  - `logical_nblocks`
+  - `next_free_pblkno`
+- ordinary slim:
+  - `old_pblkno`
+  - `new_pblkno`
+  - `next_free_pblkno`
+
+Compact birth records omit `old_pblkno`, because first-born remaps always use
+`InvalidBlockNumber` for the old physical block.
+
+Ordinary slim records omit `logical_nblocks`, because ordinary remap has a
+valid old physical baseline and does not publish a first-born logical EOF.
+
+The branch deliberately does not use a "tiny birth" header that carries only
+`new_pblkno`.  Such a format is only safe when the replay-side frontier can be
+derived locally.  That condition is too narrow for the current owner model, so
+compact birth remains the conservative birth fallback.
+
+## 2. Producer-Side Decisions in `xloginsert.c`
+
+Producer-side logic lives in `XLogRecordAssembleUmbra()`.
+
+The code still starts from PostgreSQL's normal questions:
+
+- does this record need a backup image?
+- does it need data payload?
+
+Umbra then adds a second question:
+
+- does this block record need remap metadata?
+
+Those decisions are related, but they are not collapsed into one boolean.
+
+### 2.1 Automatic checkpoint-boundary remap
+
+For ordinary data-bearing records:
+
+- `REGBUF_FORCE_IMAGE` means:
+  - backup image yes
+  - automatic remap no
+- `REGBUF_NO_IMAGE` means:
+  - backup image no
+  - automatic remap no
+- `!doPageWrites` means:
+  - backup image no
+  - automatic remap no
+- `RM_XLOG_ID / XLOG_FPI_FOR_HINT` keeps MD's hint-image rule:
+  - backup image if `page_lsn <= RedoRecPtr`
+  - no remap
+- ordinary checkpoint-boundary case means:
+  - no backup image
+  - remap if `page_lsn <= RedoRecPtr`
+
+That last rule is where Umbra replaces MD's ordinary checkpoint-boundary backup
+image path with remap-aware WAL.
+
+### 2.2 `REGBUF_LOGICAL_BIRTH`
+
+Umbra also supports an explicit first-born owner path through
+`REGBUF_LOGICAL_BIRTH`.
+
+When a registered buffer is marked logical-birth and does not already carry
+remap metadata, WAL assembly:
+
+1. opens the relation
+2. tries to find an already-published mapping
+3. tries to find a pending reserved mapping
+4. only if both are absent, reserves a fresh physical block
+
+The outcome is then recorded as:
+
+- `old_pblkno = InvalidBlockNumber`
+- `new_pblkno = chosen physical block`
+- `has_remap = true`
+
+This is not "always allocate a new pblk immediately".  It is "WAL assembly owns
+the first-born publication if no prior mapping or reservation already exists".
+The chosen `pblk` may come from a runtime reservation frontier, but that
+reservation is not yet committed superblock state.
+
+### 2.3 When remap metadata is included
+
+The current inclusion rule is:
+
+- if the block already has remap metadata, include it
+- otherwise, if the automatic remap rule says this record needs remap, build
+  and include it
+
+So remap-bearing records come from two sources:
+
+- explicit logical-birth ownership
+- ordinary checkpoint-boundary remap
+
+### 2.4 Interaction with images
+
+When remap metadata is included:
+
+- `BKPBLOCK_HAS_REMAP` is set
+- the remap header is filled
+
+If the remap came from the ordinary checkpoint-boundary path, Umbra suppresses
+the ordinary backup image unless:
+
+- the caller explicitly forced an image, or
+- `XLR_CHECK_CONSISTENCY` requires one
+
+That means the current rule is not:
+
+- "full-page image and remap are always mutually exclusive"
+
+It is:
+
+- "automatic checkpoint-boundary remap replaces the default image path"
+- explicit image owners can still coexist with remap metadata
+
+## 3. Post-Insert Publication
+
+After WAL insertion succeeds, `XLogCommitBlockRemapsUmbra()` publishes the
+winner state for each block record that carried remap metadata.
+
+That commit step:
+
+- installs the new mapping with `UmMapSetMapping()`
+- bumps committed `next_free_pblkno` when needed
+- bumps `logical_nblocks` for first-born publication
+- updates cached relation size state for WAL-owned first-born pages
+- releases pending reservations
+
+This is an important owner boundary:
+
+- the block record is assembled before insert
+- publication becomes durable owner state only after insert succeeds
+- runtime reservation state may run ahead transiently, but checkpoint-visible
+  superblock state is published only at this boundary
+
+## 4. Umbra RMGR Records
+
+Umbra also has a small rmgr (`RM_UMBRA_ID`) for MAP lifecycle operations.
+
+Current records include:
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+`XLOG_UMBRA_SKIP_WAL_DENSE_MAP` is a redo anchor for skip-WAL relations.  It
+does not replace the skip-WAL sync protocol.  For each encoded fork it states:
+
+- `[0, nblocks)` is dense
+- `pblk == lblk` in that range
+- `logical_nblocks = nblocks`
+- `physical_nblocks = nblocks`
+- `next_free_pblkno = nblocks`
+
+The producer should not encode empty forks.  A `nblocks == 0` entry has no
+mapping work and does not advance any frontier.
+
+These records are not a replacement for block-header remap metadata.
+
+Their purpose is different:
+
+- block-header remap metadata is for ordinary WAL block replay
+- Umbra rmgr records are for explicit MAP lifecycle actions such as
+  compactor/reclaim/state maintenance
+
+### 4.1 Why this branch does not use an explicit RangeMap / range-born owner model
+
+The current branch deliberately does not make `range-born / batch mapping
+publish` a first-class upper-layer contract.
+
+The reason is not that range publication is impossible.  The problem is owner
+clarity.
+
+At the PostgreSQL call sites Umbra currently has explicit ownership for:
+
+- one logical block being born for the first time
+- one logical block being remapped at a checkpoint-boundary WAL site
+- explicit Umbra-internal lifecycle records such as compactor/reclaim work
+
+What is still missing is a concrete upper-layer use site that already owns
+range publication as one semantic unit.  A future example could be something
+like a hash-AM split/redistribution path where a well-defined logical range is
+materialized and published under one owner.  The current branch does not wire
+such a caller yet.
+
+It does **not** yet have a generic upper-layer interface that says:
+
+- this WAL owner is publishing a whole logical range at once
+- this range has one well-defined ordering point
+- redo can treat that range as a single published unit
+
+Without that interface, a generic RangeMap-style contract would push too much
+ambiguity into WAL assembly and redo:
+
+- which layer owns range extent vs. per-block visibility
+- when logical EOF becomes durable for the whole range
+- how allocator frontier publication is synchronized with older AM/WAL ordering
+- whether a later block in the same range may become visible before an earlier
+  block's WAL ownership is fully established
+
+The current branch therefore stays conservative:
+
+- ordinary upper-layer WAL continues to publish remap state per block
+- first-born publication remains explicit per block
+- range-shaped operations are limited to internal Umbra-controlled lifecycle
+  paths where ownership is already local and bounded
+
+That is why the branch has range remap records for internal lifecycle work, but
+does not yet describe a generic upper-layer RangeMap contract as a settled
+feature.
+
+## 5. Redo Entry in `xlogutils.c`
+
+Redo-side interpretation lives in `XLogReadBufferForRedoExtendedUmbra()`.
+
+The redo entry layer owns:
+
+- metadata bootstrap for mapped-fork redo
+- redo-time MAP state seeding
+- interpretation of `has_remap`
+- distinction between remap-with-image and remap-without-image
+
+It intentionally does not push those semantics down into a generic read helper,
+because the generic helper does not know:
+
+- whether the record has remap metadata
+- whether the record has a block image
+- what `old_pblkno` and `new_pblkno` are
+
+## 6. Redo Bootstrap Before Replay
+
+Before replaying mapped-fork data, redo first ensures:
+
+- the relation is open
+- MAP state is seeded on the `SMgrRelation`
+- the metadata fork exists when the fork requires mapping
+
+That work is currently done by:
+
+- `XLogUmbraMapStateForRedo()`
+- `XLogUmbraEnsureMetadataForRedo()`
+- `XLogUmbraEnsureMappedBlockForRedo()`
+
+This is redo-only bootstrap logic and intentionally belongs at the redo-entry
+layer.
+
+## 7. Redo Cases
+
+### 7.1 No remap metadata
+
+If `has_remap` is false, Umbra first ensures metadata for mapped forks and then
+falls back to the ordinary PostgreSQL-style block restore/read path:
+
+- image -> restore image
+- no image -> read current block view and compare LSN
+
+This is the least interesting Umbra case; it mostly behaves like md plus
+metadata availability checks.
+
+### 7.2 Remap with image
+
+If `has_remap` and `has_image` are both true, redo:
+
+1. installs the new mapping immediately
+2. bumps `next_free_pblkno` if provided
+3. bumps `logical_nblocks` for first-born publication
+4. ensures the mapped block exists
+5. restores the block image into that new mapping view
+
+This is the current "phase-1" remap replay path.
+
+### 7.3 Remap without image, zero/init mode
+
+If `has_remap` is true, `has_image` is false, and redo is in zero/init mode,
+redo:
+
+1. installs the new mapping immediately
+2. bumps frontier payload as needed
+3. ensures the mapped block exists
+4. reads the block in zero/init mode
+
+This covers first-born and initialization-style replay where no old physical
+baseline is needed.
+
+### 7.4 Remap without image, ordinary mode
+
+This is the most Umbra-specific case.
+
+Redo requires a valid old physical baseline.  A delta-only remap is therefore
+replayed as old physical page plus WAL delta, not as overwrite-in-place on the
+new physical page.  Redo does not publish the new mapping first; instead it:
+
+1. temporarily installs `old_pblkno` as the current mapping
+2. ensures the old mapped block is readable
+3. reads and locks the buffer through that old mapping view
+4. dirties and flushes that buffer state
+5. switches the mapping to `new_pblkno`
+6. bumps `next_free_pblkno` if carried in the record
+
+The important rule is:
+
+- remap-without-image replay first reads through the old mapping view
+- it applies the WAL delta against that old physical baseline
+- it publishes the new mapping only after that baseline has been consumed
+
+That is what makes delta replay deterministic without requiring a full-page
+image in the ordinary checkpoint-boundary case.
+
+## 8. First-Born Pages
+
+A first-born page is identified by:
+
+- `old_pblkno == InvalidBlockNumber`
+
+Current first-born handling is split:
+
+- producer side may reserve and publish WAL-owned first-born remap metadata
+- post-insert publication bumps logical frontier
+- redo side uses `logical_nblocks` payload to keep replay-side logical EOF in
+  sync
+
+This avoids depending on generic `smgrextend()` ownership for WAL-owned logical
+births.
+
+## 9. Metadata Fork and Redo
+
+Mapped-fork redo depends on metadata-fork availability.
+
+The current rule is:
+
+- redo creates metadata when mapped replay requires it
+- normal data paths should not repeatedly rediscover metadata existence
+
+This keeps redo-only bootstrap in redo owner code instead of leaking it into
+unrelated access paths.
+
+## 10. Current Conservative Choices
+
+The branch still chooses conservative rules in a few places:
+
+- explicit image owners keep their image semantics
+- first-born and initialization cases carry dedicated frontier payload when it
+  cannot be derived from a stronger WAL anchor
+- checksum-driven hint FPIs still use PostgreSQL's `XLOG_FPI_FOR_HINT` path
+- redo keeps a very explicit old-view/new-view split for remap-without-image
+- remap format is record-level, so mixed birth/ordinary remap records can fall
+  back to the full header rather than using per-block variant tags
+
+Those rules are deliberate.  The current branch favors deterministic ownership
+and clear replay state over collapsing every case into a smaller but harder to
+reason about WAL contract.
+
+## 11. Summary
+
+The current Umbra WAL/redo design can be summarized as:
+
+- ordinary block records may carry remap metadata
+- remap metadata records physical transition plus frontier state
+- WAL publication of that state is committed only after insert succeeds
+- redo explicitly distinguishes no-remap, remap-with-image, and
+  remap-without-image
+- Umbra rmgr records remain available for MAP lifecycle operations outside the
+  ordinary block-header remap path
+
+That is the basis on which the current master PoC reduces ordinary
+checkpoint-boundary backup-image pressure while keeping replay deterministic.
diff --git a/doc/umbra/WAL_AND_REDO_ZH.md b/doc/umbra/WAL_AND_REDO_ZH.md
new file mode 100644
index 0000000000..f850142bcb
--- /dev/null
+++ b/doc/umbra/WAL_AND_REDO_ZH.md
@@ -0,0 +1,248 @@
+# Umbra 的 WAL 与 redo 语义
+
+本文档是 `WAL_AND_REDO.md` 的中文配套版本，说明当前 Umbra 原型中的 WAL
+内容和 redo 规则。
+
+## 1. 两类会出现在 WAL 中的机制
+
+Umbra 有两类会出现在 WAL 中的机制：
+
+- 普通 WAL block reference 上的 remap 元数据；
+- Umbra 自己的 rmgr 记录。
+
+两者不是替代关系：
+
+- block-header 里的 remap 信息，用来让普通 WAL block 回放时找到正确的物理基线；
+- Umbra rmgr 记录，用来表达 MAP 的生命周期事件。
+
+## 2. remap header
+
+当普通 block reference 设置 `BKPBLOCK_HAS_REMAP` 时，会带上 remap 的字段。
+
+完整字段如下：
+
+- `old_pblkno`
+- `new_pblkno`
+- `logical_nblocks`
+- `next_free_pblkno`
+
+它们分别表示：
+
+- `old_pblkno`
+  - 当前逻辑块旧的已发布物理基线；
+  - `InvalidBlockNumber` 表示 first-born。
+- `new_pblkno`
+  - 即将发布的新物理块。
+- `logical_nblocks`
+  - 在 first-born 或 range birth 时需要推进的逻辑 EOF。
+- `next_free_pblkno`
+  - 已提交的分配前沿；
+  - redo 用它来保持物理分配的确定性。
+
+要注意：`next_free_pblkno` 不一定等于 `new_pblkno + 1`。它表示的是全局的、
+已提交的分配前沿。
+
+## 3. WAL 生成端规则
+
+WAL 生成端的逻辑在 `XLogRecordAssembleUmbra()`。
+
+PostgreSQL 原本会判断：
+
+- 是否需要备份镜像；
+- 是否需要数据载荷。
+
+Umbra 额外增加一个判断：
+
+- 是否需要 remap 元数据。
+
+这些判断不能被压成一个单独的布尔值。
+
+在 checkpoint 边界上的普通场景里，如果页面满足自动 remap 的条件，Umbra 会用
+remap 元数据替代默认的 full-page image 路径。
+
+保守边界如下：
+
+- `REGBUF_FORCE_IMAGE` 保留 image 语义；
+- `REGBUF_NO_IMAGE` 不自动 remap；
+- `!doPageWrites` 不自动 remap；
+- `XLOG_FPI_FOR_HINT` 继续沿用 PostgreSQL 的 hint image 规则；
+- `XLR_CHECK_CONSISTENCY` 保留校验 image。
+
+## 4. first-born
+
+`REGBUF_LOGICAL_BIRTH` 是显式的 first-born owner 路径。
+
+WAL 组装时会：
+
+1. 打开 relation；
+2. 查找是否已经有已发布的 mapping；
+3. 查找是否已经有 pending 的预留 mapping；
+4. 如果两者都没有，就预留一个新的物理块。
+
+随后记录：
+
+- `old_pblkno = InvalidBlockNumber`
+- `new_pblkno = 选中的物理块`
+- `has_remap = true`
+
+这里选中的物理块可能来自运行时的预留前沿，但在 WAL insert 成功之前，它还
+不是 superblock 中已经提交的状态。
+
+## 5. WAL insert 之后的发布
+
+`XLogCommitBlockRemapsUmbra()` 在 WAL insert 成功后发布 remap 状态。
+
+它负责：
+
+- 安装新的 `lblk -> pblk` 映射；
+- 必要时推进已提交的 `next_free_pblkno`；
+- 在 first-born 时推进 `logical_nblocks`；
+- 更新由 WAL 拥有的 first-born relation size cache；
+- 释放 pending 预留。
+
+这个边界非常重要：
+
+- WAL 组装阶段可以先选择物理块；
+- 只有 WAL insert 成功后，才发布已提交的映射和前沿；
+- 运行时预留前沿可以领先；
+- 磁盘 superblock 中的已提交前沿不能领先于 WAL。
+
+## 6. Umbra 的 rmgr 记录
+
+当前 Umbra 的 rmgr 记录包括：
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+其中 `XLOG_UMBRA_SKIP_WAL_DENSE_MAP` 是 skip-WAL relation 的映射锚点。它表示：
+
+- `[0, nblocks)` 是 dense；
+- `pblk == lblk`；
+- `logical_nblocks = nblocks`；
+- `physical_nblocks = nblocks`；
+- `next_free_pblkno = nblocks`。
+
+它不是数据文件 `fsync` 的替代品；skip-WAL 的同步协议仍然独立存在。
+
+### 6.1 为什么当前不把 RangeMap / range-born 做成正式机制
+
+当前这条分支有意**没有**把 `range-born / batch mapping publish` 做成一个正式的
+上层 contract。
+
+原因不是“范围发布做不到”，而是当前还缺少足够清晰的 owner 边界。
+
+在 PostgreSQL 的现有调用点上，Umbra 目前能明确拿到的 owner 主要是：
+
+- 单个逻辑块的 first-born；
+- checkpoint 边界上单个逻辑块的 remap；
+- `compactor` / `reclaim` 这类 Umbra 内部生命周期记录。
+
+当前真正缺少的，是一个已经天然按“范围”拥有发布语义的上层使用点。未来如果有
+类似哈希访问方法做 split / redistribution 这样的路径，能够把一个逻辑范围作为
+单一 owner 单元来扩展、写 WAL 并发布，那么 range remap 才有比较自然的落点。
+当前这条分支还没有接上这样的调用点。
+
+但它还没有一个通用的上层接口，可以明确表达：
+
+- 这次 WAL owner 要一次性发布一个逻辑范围；
+- 这个范围有一个清晰、唯一的顺序边界；
+- redo 可以把整个范围当成一个已发布单元来处理。
+
+如果在缺少这种接口的情况下，强行引入通用的 RangeMap 式 contract，就会把太多
+歧义推到 WAL 组装和 redo 阶段：
+
+- 范围大小由谁拥有，单块可见性又由谁拥有；
+- 整个范围的逻辑 EOF 在什么时候算持久化完成；
+- 分配前沿的发布如何和现有 AM / WAL 顺序保持一致；
+- 同一个范围里，后面的块会不会在前面的块完成 WAL owner 建立之前就先变得可见。
+
+所以当前分支选择更保守的做法：
+
+- 普通上层 WAL 仍然按“单块”发布 remap 状态；
+- first-born 仍然按“单块”显式发布；
+- 只有在 Umbra 自己完全控制、owner 边界已经收紧的内部生命周期路径上，
+  才使用范围形态的操作。
+
+这也是为什么当前分支里会有 range remap 记录，但还不能把“通用上层 RangeMap
+contract”描述成已经收敛完成的能力。
+
+## 7. redo 入口
+
+redo 端的核心入口在 `XLogReadBufferForRedoExtendedUmbra()`。
+
+redo 入口层负责：
+
+- mapped fork 的 metadata bootstrap；
+- redo 阶段的 MAP 状态播种；
+- 解释 `has_remap`；
+- 区分带 image 的 remap 和不带 image 的 remap。
+
+这些逻辑不能下推到普通 read helper，因为普通 helper 不理解 remap header 的
+所有权语义。
+
+## 8. redo 的几种场景
+
+### 8.1 无 remap
+
+没有 remap 元数据时，Umbra 会先确保 metadata 存在，然后走接近 PostgreSQL
+普通 redo 的路径：
+
+- 有 image：恢复 image；
+- 无 image：读取当前 block view 并比较 LSN。
+
+### 8.2 带 image 的 remap
+
+有 remap 且有 image 时：
+
+1. 先安装新 mapping；
+2. 推进前沿；
+3. 在 first-born 时推进逻辑 EOF；
+4. 确保 mapped block 已存在；
+5. 把 image 恢复到新的 mapping view。
+
+### 8.3 不带 image 的 remap（zero/init）
+
+有 remap、没有 image，而且 redo 是 zero/init 模式时：
+
+1. 安装新 mapping；
+2. 推进前沿；
+3. 确保 mapped block 已存在；
+4. 按 zero/init 模式读取。
+
+### 8.4 不带 image 的 remap（普通 delta 回放）
+
+这是最关键的 Umbra 场景。
+
+普通的、没有 image 的 delta remap 需要旧物理基线。它的回放语义是“旧物理页 +
+WAL delta”，不是在新物理页上原地覆盖。因此 redo 不能先发布新的 mapping，而是
+必须：
+
+1. 临时安装 `old_pblkno`；
+2. 通过旧 mapping view 读取页面；
+3. 锁住并修改 buffer；
+4. 消费完旧基线后再切换到 `new_pblkno`；
+5. 推进 `next_free_pblkno`。
+
+核心规则是：
+
+- remap-without-image redo 先通过旧 mapping view 读取页面；
+- WAL delta 作用在这个旧物理基线上；
+- 消费完旧基线后，redo 才发布新的 mapping。
+
+这样一来，checkpoint 边界上的普通场景就可以在不依赖 full-page image 的
+情况下，仍然保持 redo 的确定性。
+
+## 9. 总结
+
+当前 WAL / redo 模型可以概括成：
+
+- 普通 block record 可以携带 remap 元数据；
+- remap 元数据表达物理迁移和前沿信息；
+- 只有 WAL insert 成功后才发布已提交的映射和前沿；
+- redo 明确区分无 remap、带 image 的 remap、以及不带 image 的 remap；
+- Umbra rmgr 记录负责普通 block header 之外的 MAP 生命周期事件。
+
+这套模型是 Umbra 降低 checkpoint 边界上 ordinary FPI 压力的正确性基础。
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#2)

[RFC PATCH v2 RESEND 02/10] umbra: add patch 1 smgr implementation boundary

---
configure | 38 ++++++
configure.ac | 10 ++
meson.build | 1 +
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/storage/smgr/Makefile | 8 ++
src/backend/storage/smgr/meson.build | 6 +
src/backend/storage/smgr/smgr.c | 52 ++++++++-
src/backend/storage/smgr/umbra.c | 167 +++++++++++++++++++++++++++
src/include/pg_config.h.in | 3 +
src/include/storage/smgr.h | 11 +-
src/include/storage/umbra.h | 54 +++++++++
12 files changed, 345 insertions(+), 9 deletions(-)
create mode 100644 src/backend/storage/smgr/umbra.c
create mode 100644 src/include/storage/umbra.h

diff --git a/configure b/configure
index f66c1054a7..a63ecb3745 100755
--- a/configure
+++ b/configure
@@ -723,6 +723,7 @@ LIBURING_LIBS
 LIBURING_CFLAGS
 with_liburing
 with_readline
+with_umbra
 with_systemd
 with_selinux
 with_ldap
@@ -872,6 +873,7 @@ with_ldap
 with_bonjour
 with_selinux
 with_systemd
+with_umbra
 with_readline
 with_libedit_preferred
 with_liburing
@@ -1590,6 +1592,7 @@ Optional Packages:
   --with-bonjour          build with Bonjour support
   --with-selinux          build with SELinux support
   --with-systemd          build with systemd support
+  --with-umbra            build with Umbra storage manager (experimental)
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
@@ -8635,6 +8638,41 @@ fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_systemd" >&5
 $as_echo "$with_systemd" >&6; }

+#
+# Umbra storage manager (experimental)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with Umbra storage manager" >&5
+$as_echo_n "checking whether to build with Umbra storage manager... " >&6; }
+
+
+
+# Check whether --with-umbra was given.
+if test "${with_umbra+set}" = set; then :
+  withval=$with_umbra;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_UMBRA 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-umbra option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_umbra=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_umbra" >&5
+$as_echo "$with_umbra" >&6; }
+
+
 #
 # Readline
 #
diff --git a/configure.ac b/configure.ac
index 8d176bd346..43006a5284 100644
--- a/configure.ac
+++ b/configure.ac
@@ -992,6 +992,16 @@ PGAC_ARG_BOOL(with, systemd, no, [build with systemd support],
 AC_SUBST(with_systemd)
 AC_MSG_RESULT([$with_systemd])

+#
+# Umbra storage manager (experimental)
+#
+AC_MSG_CHECKING([whether to build with Umbra storage manager])
+PGAC_ARG_BOOL(with, umbra, no,
+              [build with Umbra storage manager (experimental)],
+              [AC_DEFINE([USE_UMBRA], 1, [Define to build with Umbra storage manager. (--with-umbra)])])
+AC_MSG_RESULT([$with_umbra])
+AC_SUBST(with_umbra)
+
 #
 # Readline
 #
diff --git a/meson.build b/meson.build
index be97e986e5..016ba9fc0c 100644
--- a/meson.build
+++ b/meson.build
@@ -505,6 +505,7 @@ meson_bin = find_program(meson_binpath, native: true)

cdata.set('USE_ASSERT_CHECKING', get_option('cassert') ? 1 : false)
cdata.set('USE_INJECTION_POINTS', get_option('injection_points') ? 1 : false)
+cdata.set('USE_UMBRA', get_option('umbra').enabled() ? 1 : false)

blocksize = get_option('blocksize').to_int() * 1024

diff --git a/meson_options.txt b/meson_options.txt
index 6a793f3e47..469239543c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -46,6 +46,9 @@ option('tap_tests', type: 'feature', value: 'auto',
 option('injection_points', type: 'boolean', value: false,
   description: 'Enable injection points')

+option('umbra', type: 'feature', value: 'disabled',
+  description: 'Enable experimental Umbra storage manager')
+
 option('PG_TEST_EXTRA', type: 'string', value: '',
   description: 'Enable selected extra tests. Overridden by PG_TEST_EXTRA environment variable.')

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index a7699b026b..4e2815218a 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
+with_umbra	= @with_umbra@
 with_system_tzdata = @with_system_tzdata@
 with_uuid	= @with_uuid@
 with_zlib	= @with_zlib@
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 1d0b98764f..537e7b65f4 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -17,4 +17,12 @@ OBJS = \
 	md.o \
 	smgr.o

+ifeq ($(with_umbra), yes)
+OBJS += \
+	umbra.o
+endif
+
+# Reconfiguration can change both OBJS and the default smgr implementation.
+objfiles.txt smgr.o: $(top_builddir)/src/Makefile.global
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index 3785c40385..ba28e59f2f 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -5,3 +5,9 @@ backend_sources += files(
   'md.c',
   'smgr.c',
 )
+
+if get_option('umbra').enabled()
+  backend_sources += files(
+    'umbra.c',
+  )
+endif
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5391640d86..a7b70d856c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -4,8 +4,9 @@
  *	  public interface routines to storage manager switch.
  *
  * All file system operations on relations dispatch through these routines.
- * An SMgrRelation represents physical on-disk relation files that are open
- * for reading and writing.
+ * An SMgrRelation represents storage-manager state for a relation.  The
+ * selected storage manager implementation owns any implementation-specific
+ * state needed to service those operations.
  *
  * When a relation is first accessed through the relation cache, the
  * corresponding SMgrRelation entry is opened by calling smgropen(), and the
@@ -71,6 +72,9 @@
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#ifdef USE_UMBRA
+#include "storage/umbra.h"
+#endif
 #include "utils/hsearch.h"
 #include "utils/inval.h"

@@ -91,6 +95,7 @@ typedef struct f_smgr
 	void		(*smgr_shutdown) (void);	/* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_destroy) (SMgrRelation reln);	/* may be NULL */
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
 								bool isRedo);
 	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
@@ -125,6 +130,14 @@ typedef struct f_smgr
 	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;

+#define SMGR_MD		0
+#ifdef USE_UMBRA
+#define SMGR_UMBRA	1
+#define SMGR_DEFAULT	SMGR_UMBRA
+#else
+#define SMGR_DEFAULT	SMGR_MD
+#endif
+
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -132,6 +145,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_shutdown = NULL,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
+		.smgr_destroy = NULL,
 		.smgr_create = mdcreate,
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
@@ -148,7 +162,33 @@ static const f_smgr smgrsw[] = {
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
 		.smgr_fd = mdfd,
-	}
+	},
+#ifdef USE_UMBRA
+	/* Umbra storage manager */
+	{
+		.smgr_init = uminit,
+		.smgr_shutdown = NULL,
+		.smgr_open = umopen,
+		.smgr_close = umclose,
+		.smgr_destroy = umdestroy,
+		.smgr_create = umcreate,
+		.smgr_exists = umexists,
+		.smgr_unlink = umunlink,
+		.smgr_extend = umextend,
+		.smgr_zeroextend = umzeroextend,
+		.smgr_prefetch = umprefetch,
+		.smgr_maxcombine = ummaxcombine,
+		.smgr_readv = umreadv,
+		.smgr_startreadv = umstartreadv,
+		.smgr_writev = umwritev,
+		.smgr_writeback = umwriteback,
+		.smgr_nblocks = umnblocks,
+		.smgr_truncate = umtruncate,
+		.smgr_immedsync = umimmedsync,
+		.smgr_registersync = umregistersync,
+		.smgr_fd = umfd,
+	},
+#endif
 };

 static const int NSmgr = lengthof(smgrsw);
@@ -273,7 +313,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = SMGR_DEFAULT;
+		reln->smgr_private = NULL;

/* it is not pinned yet */
reln->pincount = 0;
@@ -331,6 +372,9 @@ smgrdestroy(SMgrRelation reln)
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
smgrsw[reln->smgr_which].smgr_close(reln, forknum);

+	if (smgrsw[reln->smgr_which].smgr_destroy != NULL)
+		smgrsw[reln->smgr_which].smgr_destroy(reln);
+
 	dlist_delete(&reln->node);

 	if (hash_search(SMgrRelationHash,
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
new file mode 100644
index 0000000000..4c4fd28dbf
--- /dev/null
+++ b/src/backend/storage/smgr/umbra.c
@@ -0,0 +1,167 @@
+/*-------------------------------------------------------------------------
+ *
+ * umbra.c
+ *	  Umbra storage manager skeleton.
+ *
+ * This file establishes Umbra as a separate smgr implementation from md.c.
+ * The initial implementation preserves md semantics by forwarding relation
+ * file operations to md.c.
+ *
+ * src/backend/storage/smgr/umbra.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/md.h"
+#include "storage/smgr.h"
+#include "storage/umbra.h"
+#include "utils/memutils.h"
+
+typedef struct UmbraSmgrRelationState
+{
+	bool		initialized;
+} UmbraSmgrRelationState;
+
+void
+uminit(void)
+{
+}
+
+void
+umopen(SMgrRelation reln)
+{
+	UmbraSmgrRelationState *state;
+
+	Assert(reln->smgr_private == NULL);
+
+	state = MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UmbraSmgrRelationState));
+	state->initialized = true;
+	reln->smgr_private = state;
+
+	mdopen(reln);
+}
+
+void
+umclose(SMgrRelation reln, ForkNumber forknum)
+{
+	mdclose(reln, forknum);
+}
+
+void
+umdestroy(SMgrRelation reln)
+{
+	UmbraSmgrRelationState *state = reln->smgr_private;
+
+	if (state != NULL)
+	{
+		Assert(state->initialized);
+		pfree(state);
+		reln->smgr_private = NULL;
+	}
+}
+
+void
+umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	mdcreate(reln, forknum, isRedo);
+}
+
+bool
+umexists(SMgrRelation reln, ForkNumber forknum)
+{
+	return mdexists(reln, forknum);
+}
+
+void
+umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+{
+	mdunlink(rlocator, forknum, isRedo);
+}
+
+void
+umextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void *buffer, bool skipFsync)
+{
+	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+}
+
+void
+umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 int nblocks, bool skipFsync)
+{
+	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+}
+
+bool
+umprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   int nblocks)
+{
+	return mdprefetch(reln, forknum, blocknum, nblocks);
+}
+
+uint32
+ummaxcombine(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	return mdmaxcombine(reln, forknum, blocknum);
+}
+
+void
+umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	mdreadv(reln, forknum, blocknum, buffers, nblocks);
+}
+
+void
+umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, void **buffers, BlockNumber nblocks)
+{
+	mdstartreadv(ioh, reln, forknum, blocknum, buffers, nblocks);
+}
+
+void
+umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+}
+
+void
+umwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks)
+{
+	mdwriteback(reln, forknum, blocknum, nblocks);
+}
+
+BlockNumber
+umnblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	return mdnblocks(reln, forknum);
+}
+
+void
+umtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber old_blocks, BlockNumber nblocks)
+{
+	mdtruncate(reln, forknum, old_blocks, nblocks);
+}
+
+void
+umimmedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	mdimmedsync(reln, forknum);
+}
+
+void
+umregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	mdregistersync(reln, forknum);
+}
+
+int
+umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return mdfd(reln, forknum, blocknum, off);
+}
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 4f8113c144..2bd28af842 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -698,6 +698,9 @@
 /* Define to 1 to build with injection points. (--enable-injection-points) */
 #undef USE_INJECTION_POINTS

+/* Define to build with Umbra storage manager. (--with-umbra) */
+#undef USE_UMBRA
+
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP

diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 09bd42fcf4..1076717b92 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -21,11 +21,11 @@

 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
- * cached file handles.  An SMgrRelation is created (if not already present)
- * by smgropen(), and destroyed by smgrdestroy().  Note that neither of these
- * operations imply I/O, they just create or destroy a hashtable entry.  (But
- * smgrdestroy() may release associated resources, such as OS-level file
- * descriptors.)
+ * cached storage-manager handles for a relation.  An SMgrRelation is created
+ * (if not already present) by smgropen(), and destroyed by smgrdestroy().
+ * Note that neither of these operations imply I/O, they just create or destroy
+ * a hashtable entry.  (But smgrdestroy() may release associated resources,
+ * such as OS-level file descriptors.)
  *
  * An SMgrRelation may be "pinned", to prevent it from being destroyed while
  * it's in use.  We use this to prevent pointers in relcache to smgr from being
@@ -53,6 +53,7 @@ typedef struct SMgrRelationData
 	 * submodules.  Do not touch them from elsewhere.
 	 */
 	int			smgr_which;		/* storage manager selector */
+	void	   *smgr_private;	/* implementation-private state */

 	/*
 	 * for md.c; per-fork arrays of the number of open segments
diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
new file mode 100644
index 0000000000..9a2873f96d
--- /dev/null
+++ b/src/include/storage/umbra.h
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * umbra.h
+ *	  Umbra storage manager public interface declarations.
+ *
+ * This header declares the Umbra smgr callback surface used by smgr.c when
+ * the build is configured with --with-umbra.
+ *
+ * src/include/storage/umbra.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UMBRA_H
+#define UMBRA_H
+
+#include "storage/aio_types.h"
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
+
+extern void uminit(void);
+extern void umopen(SMgrRelation reln);
+extern void umclose(SMgrRelation reln, ForkNumber forknum);
+extern void umdestroy(SMgrRelation reln);
+extern void umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool umexists(SMgrRelation reln, ForkNumber forknum);
+extern void umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
+extern void umextend(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void umzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
+extern bool umprefetch(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum, int nblocks);
+extern uint32 ummaxcombine(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum);
+extern void umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void umstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
+extern void umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void umwriteback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber umnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void umtruncate(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber old_blocks, BlockNumber nblocks);
+extern void umimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void umregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	umfd(SMgrRelation reln, ForkNumber forknum,
+				 BlockNumber blocknum, uint32 *off);
+
+#endif							/* UMBRA_H */
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 03/10] umbra: add patch 2 umfile physical file manager and metadata storage primitives

---
src/backend/storage/smgr/Makefile | 1 +
src/backend/storage/smgr/meson.build | 1 +
src/backend/storage/smgr/umbra.c | 106 ++-
src/backend/storage/smgr/umfile.c | 1146 ++++++++++++++++++++++++++
src/include/storage/um_defs.h | 49 ++
src/include/storage/umbra.h | 12 +
src/include/storage/umfile.h | 101 +++
7 files changed, 1411 insertions(+), 5 deletions(-)
create mode 100644 src/backend/storage/smgr/umfile.c
create mode 100644 src/include/storage/um_defs.h
create mode 100644 src/include/storage/umfile.h

diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 537e7b65f4..32d72d8831 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -19,6 +19,7 @@ OBJS = \

ifeq ($(with_umbra), yes)
OBJS += \
+ umfile.o \
umbra.o
endif

diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index ba28e59f2f..00313617bf 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -8,6 +8,7 @@ backend_sources += files(

 if get_option('umbra').enabled()
   backend_sources += files(
+    'umfile.c',
     'umbra.c',
   )
 endif
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index 4c4fd28dbf..2c08231587 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -4,8 +4,8 @@
  *	  Umbra storage manager skeleton.
  *
  * This file establishes Umbra as a separate smgr implementation from md.c.
- * The initial implementation preserves md semantics by forwarding relation
- * file operations to md.c.
+ * Data-fork operations remain md-backed here, while relation-local metadata
+ * file operations go through umfile.
  *
  * src/backend/storage/smgr/umbra.c
  *
@@ -15,17 +15,87 @@

#include "storage/md.h"
#include "storage/smgr.h"
+#include "storage/umfile.h"
#include "storage/umbra.h"
#include "utils/memutils.h"

typedef struct UmbraSmgrRelationState
{
- bool initialized;
+ UmbraFileContext *filectx;
} UmbraSmgrRelationState;

+static UmbraFileContext *um_relation_filectx(SMgrRelation reln);
+
+bool
+UmMetadataExists(SMgrRelation reln)
+{
+	return umfile_exists(um_relation_filectx(reln),
+						 UMBRA_METADATA_FORKNUM,
+						 UMFILE_EXISTS_DENSE);
+}
+
+bool
+UmMetadataOpenOrCreate(SMgrRelation reln, bool isRedo, bool *created)
+{
+	return umfile_open_or_create(um_relation_filectx(reln),
+								 UMBRA_METADATA_FORKNUM,
+								 isRedo,
+								 created);
+}
+
+BlockNumber
+UmMetadataNblocks(SMgrRelation reln)
+{
+	return umfile_nblocks(um_relation_filectx(reln),
+						  UMBRA_METADATA_FORKNUM,
+						  UMFILE_NBLOCKS_DENSE);
+}
+
+void
+UmMetadataRead(SMgrRelation reln, BlockNumber blkno, void *buffer)
+{
+	void	   *buffers[1];
+
+	buffers[0] = buffer;
+	umfile_readv(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
+				 buffers, 1);
+}
+
+void
+UmMetadataWrite(SMgrRelation reln, BlockNumber blkno, const void *buffer,
+				bool skipFsync)
+{
+	const void *buffers[1];
+
+	buffers[0] = buffer;
+	umfile_writev(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
+				  buffers, 1, skipFsync);
+}
+
+void
+UmMetadataExtend(SMgrRelation reln, BlockNumber blkno, const void *buffer,
+				 bool skipFsync)
+{
+	umfile_extend(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
+				  buffer, skipFsync);
+}
+
+void
+UmMetadataImmediateSync(SMgrRelation reln)
+{
+	umfile_immedsync(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM);
+}
+
+void
+UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo)
+{
+	umfile_unlink(rlocator, UMBRA_METADATA_FORKNUM, isRedo);
+}
+
 void
 uminit(void)
 {
+	umfile_init();
 }

void
@@ -37,7 +107,7 @@ umopen(SMgrRelation reln)

 	state = MemoryContextAllocZero(TopMemoryContext,
 								   sizeof(UmbraSmgrRelationState));
-	state->initialized = true;
+	state->filectx = umfile_ctx_acquire(reln->smgr_rlocator);
 	reln->smgr_private = state;

mdopen(reln);
@@ -54,9 +124,10 @@ umdestroy(SMgrRelation reln)
{
UmbraSmgrRelationState *state = reln->smgr_private;

+	umfile_ctx_forget(reln->smgr_rlocator);
+
 	if (state != NULL)
 	{
-		Assert(state->initialized);
 		pfree(state);
 		reln->smgr_private = NULL;
 	}
@@ -77,6 +148,17 @@ umexists(SMgrRelation reln, ForkNumber forknum)
 void
 umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 {
+	umfile_ctx_forget(rlocator);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+	{
+		UmMetadataUnlink(rlocator, isRedo);
+		return;
+	}
+
+	if (forknum == MAIN_FORKNUM || forknum == InvalidForkNumber)
+		UmMetadataUnlink(rlocator, isRedo);
+
 	mdunlink(rlocator, forknum, isRedo);
 }

@@ -165,3 +247,17 @@ umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 {
 	return mdfd(reln, forknum, blocknum, off);
 }
+
+static UmbraFileContext *
+um_relation_filectx(SMgrRelation reln)
+{
+	UmbraSmgrRelationState *state = reln->smgr_private;
+
+	if (state == NULL)
+		return umfile_ctx_acquire(reln->smgr_rlocator);
+
+	if (state->filectx == NULL)
+		state->filectx = umfile_ctx_acquire(reln->smgr_rlocator);
+
+	return state->filectx;
+}
diff --git a/src/backend/storage/smgr/umfile.c b/src/backend/storage/smgr/umfile.c
new file mode 100644
index 0000000000..f8d1140840
--- /dev/null
+++ b/src/backend/storage/smgr/umfile.c
@@ -0,0 +1,1146 @@
+/*-------------------------------------------------------------------------
+ *
+ * umfile.c
+ *	  Umbra backend-local file/segment helpers.
+ *
+ * This layer owns backend-local file contexts keyed by RelFileLocatorBackend
+ * and provides physical fork/segment management beneath Umbra metadata and
+ * mapping code.
+ *
+ * src/backend/storage/smgr/umfile.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <unistd.h>
+
+#include "access/xlogutils.h"
+#include "commands/tablespace.h"
+#include "common/relpath.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/um_defs.h"
+#include "storage/umfile.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/* Behavior flags for segment open helpers. */
+#define UM_EXTENSION_FAIL				(1 << 0)
+#define UM_EXTENSION_RETURN_NULL		(1 << 1)
+#define UM_EXTENSION_CREATE				(1 << 2)
+#define UM_EXTENSION_CREATE_RECOVERY	(1 << 3)
+#define UM_EXTENSION_DONT_OPEN			(1 << 5)
+
+typedef struct UmCtxRegistryEntry
+{
+	RelFileLocatorBackend rlocator;
+	UmbraFileContext *ctx;
+} UmCtxRegistryEntry;
+
+typedef struct UmfdVec
+{
+	File		umfd_vfd;
+	BlockNumber	umfd_segno;
+} UmfdVec;
+
+struct UmbraFileContext
+{
+	RelFileLocatorBackend rlocator;
+	int			num_open_segs[UMBRA_FORK_SLOTS];
+	UmfdVec    *seg_fds[UMBRA_FORK_SLOTS];
+	uint32		refcount;
+};
+
+static MemoryContext UmFileCxt = NULL;
+static HTAB *UmFileContextHash = NULL;
+
+static void umfile_ctx_registry_init(void);
+static UmbraFileContext *umfile_ctx_create(RelFileLocatorBackend rlocator);
+static void umfile_ctx_destroy(UmbraFileContext *ctx);
+static void umfile_close_open_segments(UmbraFileContext *ctx,
+									   ForkNumber forknum);
+static bool umfile_create(UmbraFileContext *ctx, ForkNumber forknum,
+						  bool isRedo);
+static int	umfile_open_flags(void);
+static void umfile_fdvec_resize(UmbraFileContext *ctx, ForkNumber forknum,
+								int nseg);
+static inline UmfdVec *umfile_v_get(UmbraFileContext *ctx,
+									ForkNumber forknum, int segindex);
+static BlockNumber umfile_nblocks_in_seg(File vfd);
+static RelPathStr umfile_segpath(RelFileLocatorBackend rlocator,
+								 ForkNumber forknum, BlockNumber segno);
+static UmfdVec *umfile_openseg(UmbraFileContext *ctx,
+							   RelFileLocatorBackend rlocator,
+							   ForkNumber forknum,
+							   BlockNumber segno, int oflags);
+static UmfdVec *umfile_openfork(UmbraFileContext *ctx,
+								RelFileLocatorBackend rlocator,
+								ForkNumber forknum, int behavior);
+static UmfdVec *umfile_getseg(UmbraFileContext *ctx,
+							  RelFileLocatorBackend rlocator,
+							  ForkNumber forknum, BlockNumber blkno,
+							  bool skipFsync, int behavior);
+static bool umfile_fork_has_open_segment(UmbraFileContext *ctx,
+										 ForkNumber forknum);
+static bool umfile_fork_has_open_segment_on_disk(UmbraFileContext *ctx,
+												 RelFileLocatorBackend rlocator,
+												 ForkNumber forknum);
+static inline bool umfile_seg_entry_is_open(const UmfdVec *seg);
+static inline void umfile_seg_entry_reset(UmfdVec *seg);
+
+void
+umfile_init(void)
+{
+	HASHCTL		ctl;
+
+	if (UmFileContextHash != NULL)
+		return;
+
+	UmFileCxt = AllocSetContextCreate(TopMemoryContext,
+									  "UmFile",
+									  ALLOCSET_DEFAULT_SIZES);
+	MemoryContextAllowInCriticalSection(UmFileCxt, true);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(RelFileLocatorBackend);
+	ctl.entrysize = sizeof(UmCtxRegistryEntry);
+	ctl.hcxt = UmFileCxt;
+
+	UmFileContextHash = hash_create("Umbra file context registry",
+									256,
+									&ctl,
+									HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+UmbraFileContext *
+umfile_ctx_lookup(RelFileLocatorBackend rlocator)
+{
+	UmCtxRegistryEntry *entry;
+
+	umfile_ctx_registry_init();
+	entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
+	if (entry == NULL)
+		return NULL;
+
+	return entry->ctx;
+}
+
+UmbraFileContext *
+umfile_ctx_acquire(RelFileLocatorBackend rlocator)
+{
+	UmCtxRegistryEntry *entry;
+	bool		found;
+
+	umfile_ctx_registry_init();
+	entry = hash_search(UmFileContextHash, &rlocator, HASH_ENTER, &found);
+	if (!found)
+		entry->ctx = umfile_ctx_create(rlocator);
+	entry->ctx->refcount++;
+
+	return entry->ctx;
+}
+
+UmbraFileContext *
+umfile_ctx_create_temporary(RelFileLocatorBackend rlocator)
+{
+	umfile_ctx_registry_init();
+	return umfile_ctx_create(rlocator);
+}
+
+void
+umfile_ctx_destroy_temporary(UmbraFileContext *ctx)
+{
+	if (ctx == NULL)
+		return;
+
+	umfile_ctx_destroy(ctx);
+}
+
+void
+umfile_ctx_release(RelFileLocatorBackend rlocator)
+{
+	UmCtxRegistryEntry *entry;
+	UmbraFileContext *ctx;
+
+	if (UmFileContextHash == NULL)
+		return;
+
+	entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
+	if (entry == NULL)
+		return;
+
+	ctx = entry->ctx;
+	Assert(ctx->refcount > 0);
+	ctx->refcount--;
+
+	if (ctx->refcount == 0)
+	{
+		umfile_ctx_destroy(ctx);
+		(void) hash_search(UmFileContextHash, &rlocator, HASH_REMOVE, NULL);
+	}
+}
+
+void
+umfile_ctx_forget(RelFileLocatorBackend rlocator)
+{
+	UmCtxRegistryEntry *entry;
+	UmbraFileContext *ctx;
+
+	if (UmFileContextHash == NULL)
+		return;
+
+	entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
+	if (entry == NULL)
+		return;
+
+	ctx = entry->ctx;
+	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+		umfile_close_open_segments(ctx, forknum);
+
+	if (ctx->refcount == 0)
+	{
+		umfile_ctx_destroy(ctx);
+		(void) hash_search(UmFileContextHash, &rlocator, HASH_REMOVE, NULL);
+	}
+}
+
+void
+umfile_ctx_close_fork(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	if (ctx == NULL)
+		return;
+
+	umfile_close_open_segments(ctx, forknum);
+}
+
+bool
+umfile_ctx_fork_exists(UmbraFileContext *ctx, ForkNumber forknum,
+					   UmFileExistsMode mode)
+{
+	if (ctx == NULL)
+		return false;
+
+	return umfile_exists(ctx, forknum, mode);
+}
+
+BlockNumber
+umfile_ctx_get_nblocks(UmbraFileContext *ctx, ForkNumber forknum,
+					   UmFileNblocksMode mode)
+{
+	Assert(ctx != NULL);
+	return umfile_nblocks(ctx, forknum, mode);
+}
+
+void
+umfile_ctx_read(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+				char *buffer, int nbytes)
+{
+	UmfdVec    *seg;
+	off_t		offset;
+	ssize_t		got;
+
+	Assert(ctx != NULL);
+	Assert(buffer != NULL);
+	Assert(nbytes > 0 && nbytes <= BLCKSZ);
+
+	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
+						false,
+						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY);
+	offset = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
+	got = FileRead(seg->umfd_vfd, buffer, nbytes, offset,
+				   WAIT_EVENT_DATA_FILE_READ);
+	if (got < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read block %u in file \"%s\": %m",
+						blkno, FilePathName(seg->umfd_vfd))));
+	if (got != nbytes)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\"",
+						blkno, FilePathName(seg->umfd_vfd)),
+				 errdetail("Read only %zd of %d bytes.", got, nbytes)));
+}
+
+void
+umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+				 const char *buffer, int nbytes, bool skipFsync)
+{
+	UmfdVec    *seg;
+	BlockNumber	nblocks;
+	off_t		offset;
+	ssize_t		wrote;
+
+	Assert(ctx != NULL);
+	Assert(buffer != NULL);
+	Assert(nbytes > 0 && nbytes <= BLCKSZ);
+
+	nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+	if (blkno >= nblocks)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("cannot overwrite block %u in relation %u/%u/%u fork %d",
+						blkno,
+						ctx->rlocator.locator.spcOid,
+						ctx->rlocator.locator.dbOid,
+						ctx->rlocator.locator.relNumber,
+						forknum),
+				 errdetail("Current fork size is %u blocks.", nblocks)));
+
+	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
+						skipFsync,
+						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY);
+	offset = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
+	wrote = FileWrite(seg->umfd_vfd, buffer, nbytes, offset,
+					  WAIT_EVENT_DATA_FILE_WRITE);
+	if (wrote < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write block %u in file \"%s\": %m",
+						blkno, FilePathName(seg->umfd_vfd))));
+	if (wrote != nbytes)
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\"",
+						blkno, FilePathName(seg->umfd_vfd)),
+				 errdetail("Wrote only %zd of %d bytes.", wrote, nbytes)));
+
+	/*
+	 * Sync policy is explicit at this layer: callers use
+	 * umfile_registersync()/umfile_immedsync() for durable requests.
+	 */
+	(void) skipFsync;
+}
+
+void
+umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+				  const char *buffer)
+{
+	BlockNumber	nblocks;
+
+	Assert(ctx != NULL);
+	Assert(buffer != NULL);
+
+	(void) umfile_open_or_create(ctx, forknum, false, NULL);
+	nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+	if (blkno != nblocks)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("cannot extend relation %u/%u/%u fork %d at block %u",
+						ctx->rlocator.locator.spcOid,
+						ctx->rlocator.locator.dbOid,
+						ctx->rlocator.locator.relNumber,
+						forknum, blkno),
+				 errdetail("Expected next block %u.", nblocks)));
+
+	umfile_extend(ctx, forknum, blkno, buffer, true);
+}
+
+void
+umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
+					  bool isRedo)
+{
+	umfile_unlink(rlocator, forknum, isRedo);
+}
+
+bool
+umfile_metadata_exists(UmbraFileContext *ctx)
+{
+	return umfile_exists(ctx, UMBRA_METADATA_FORKNUM, UMFILE_EXISTS_DENSE);
+}
+
+bool
+umfile_metadata_open_or_create(UmbraFileContext *ctx, bool isRedo, bool *created)
+{
+	return umfile_open_or_create(ctx, UMBRA_METADATA_FORKNUM, isRedo, created);
+}
+
+BlockNumber
+umfile_metadata_nblocks(UmbraFileContext *ctx)
+{
+	return umfile_nblocks(ctx, UMBRA_METADATA_FORKNUM, UMFILE_NBLOCKS_DENSE);
+}
+
+void
+umfile_metadata_read(UmbraFileContext *ctx, BlockNumber blkno, void *buffer)
+{
+	void	   *buffers[1];
+
+	buffers[0] = buffer;
+	umfile_readv(ctx, UMBRA_METADATA_FORKNUM, blkno, buffers, 1);
+}
+
+void
+umfile_metadata_write(UmbraFileContext *ctx, BlockNumber blkno, const void *buffer)
+{
+	const void *buffers[1];
+
+	buffers[0] = buffer;
+	umfile_writev(ctx, UMBRA_METADATA_FORKNUM, blkno, buffers, 1, false);
+}
+
+void
+umfile_metadata_extend(UmbraFileContext *ctx, BlockNumber blkno, const void *buffer)
+{
+	umfile_extend(ctx, UMBRA_METADATA_FORKNUM, blkno, buffer, false);
+}
+
+void
+umfile_metadata_immedsync(UmbraFileContext *ctx)
+{
+	umfile_immedsync(ctx, UMBRA_METADATA_FORKNUM);
+}
+
+void
+umfile_metadata_unlink(RelFileLocatorBackend rlocator, bool isRedo)
+{
+	umfile_unlink(rlocator, UMBRA_METADATA_FORKNUM, isRedo);
+}
+
+bool
+umfile_exists(UmbraFileContext *ctx, ForkNumber forknum, UmFileExistsMode mode)
+{
+	Assert(ctx != NULL);
+	(void) mode;
+
+	if (umfile_fork_has_open_segment(ctx, forknum))
+	{
+		if (umfile_fork_has_open_segment_on_disk(ctx, ctx->rlocator, forknum))
+			return true;
+
+		umfile_close_open_segments(ctx, forknum);
+	}
+
+	return (umfile_openfork(ctx, ctx->rlocator, forknum,
+							UM_EXTENSION_RETURN_NULL) != NULL);
+}
+
+bool
+umfile_open_or_create(UmbraFileContext *ctx, ForkNumber forknum,
+					  bool isRedo, bool *created)
+{
+	UmfdVec    *seg;
+	bool		was_created;
+
+	Assert(ctx != NULL);
+
+	if (created != NULL)
+		*created = false;
+
+	seg = umfile_openfork(ctx, ctx->rlocator, forknum,
+						  UM_EXTENSION_RETURN_NULL);
+	if (seg != NULL)
+		return true;
+
+	was_created = umfile_create(ctx, forknum, isRedo);
+	if (created != NULL)
+		*created = was_created;
+
+	return true;
+}
+
+BlockNumber
+umfile_nblocks(UmbraFileContext *ctx, ForkNumber forknum, UmFileNblocksMode mode)
+{
+	UmfdVec    *seg;
+	BlockNumber	segno;
+	BlockNumber	nblocks;
+
+	Assert(ctx != NULL);
+	(void) mode;
+
+	if (umfile_openfork(ctx, ctx->rlocator, forknum,
+						UM_EXTENSION_RETURN_NULL) == NULL)
+		return 0;
+
+	Assert(ctx->num_open_segs[forknum] > 0);
+	segno = ctx->num_open_segs[forknum] - 1;
+	seg = umfile_v_get(ctx, forknum, segno);
+
+	for (;;)
+	{
+		nblocks = umfile_nblocks_in_seg(seg->umfd_vfd);
+		if (nblocks > (BlockNumber) RELSEG_SIZE)
+			elog(FATAL, "Umbra segment too big");
+		if (nblocks < (BlockNumber) RELSEG_SIZE)
+			return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
+
+		segno++;
+		seg = umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0);
+		if (seg == NULL)
+			return segno * ((BlockNumber) RELSEG_SIZE);
+	}
+}
+
+void
+umfile_readv(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	for (BlockNumber i = 0; i < nblocks; i++)
+		umfile_ctx_read(ctx, forknum, blocknum + i, buffers[i], BLCKSZ);
+}
+
+void
+umfile_writev(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	for (BlockNumber i = 0; i < nblocks; i++)
+		umfile_ctx_write(ctx, forknum, blocknum + i, buffers[i], BLCKSZ,
+						 skipFsync);
+}
+
+void
+umfile_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			  const void *buffer, bool skipFsync)
+{
+	UmfdVec    *seg;
+	off_t		offset;
+	ssize_t		wrote;
+
+	Assert(ctx != NULL);
+	Assert(buffer != NULL);
+
+	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blocknum,
+						skipFsync,
+						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE);
+	offset = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	wrote = FileWrite(seg->umfd_vfd, buffer, BLCKSZ, offset,
+					  WAIT_EVENT_DATA_FILE_EXTEND);
+	if (wrote < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not extend file \"%s\": %m",
+						FilePathName(seg->umfd_vfd))));
+	if (wrote != BLCKSZ)
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not extend file \"%s\" at block %u",
+						FilePathName(seg->umfd_vfd), blocknum),
+				 errdetail("Wrote only %zd of %d bytes.", wrote, BLCKSZ)));
+
+	(void) skipFsync;
+}
+
+void
+umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum,
+				  BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	Assert(ctx != NULL);
+	Assert(nblocks >= 0);
+
+	while (nblocks > 0)
+	{
+		UmfdVec    *seg;
+		BlockNumber nblocks_this_segment;
+		off_t		offset;
+		int			ret;
+
+		seg = umfile_getseg(ctx, ctx->rlocator, forknum, blocknum,
+							skipFsync,
+							UM_EXTENSION_FAIL | UM_EXTENSION_CREATE);
+		offset = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		nblocks_this_segment =
+			Min((BlockNumber) nblocks,
+				((BlockNumber) RELSEG_SIZE) -
+				(blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+		ret = FileZero(seg->umfd_vfd,
+					   offset,
+					   (off_t) BLCKSZ * nblocks_this_segment,
+					   WAIT_EVENT_DATA_FILE_EXTEND);
+		if (ret < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not zero-extend file \"%s\": %m",
+							FilePathName(seg->umfd_vfd))));
+
+		nblocks -= nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
+}
+
+void
+umfile_truncate(UmbraFileContext *ctx, ForkNumber forknum,
+				BlockNumber old_blocks, BlockNumber nblocks)
+{
+	int			curopensegs;
+
+	Assert(ctx != NULL);
+
+	if (nblocks > old_blocks)
+	{
+		if (InRecovery)
+			return;
+
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("cannot truncate relation %u/%u/%u fork %d to %u blocks: current size is only %u blocks",
+						ctx->rlocator.locator.spcOid,
+						ctx->rlocator.locator.dbOid,
+						ctx->rlocator.locator.relNumber,
+						forknum,
+						nblocks,
+						old_blocks)));
+	}
+
+	if (nblocks == old_blocks)
+		return;
+
+	/*
+	 * Bring all dense segments into the local array first, then trim from the
+	 * tail.  This keeps the truncate contract local to the file manager.
+	 */
+	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+	curopensegs = ctx->num_open_segs[forknum];
+
+	while (curopensegs > 0)
+	{
+		UmfdVec    *seg;
+		BlockNumber	priorblocks;
+
+		priorblocks = (curopensegs - 1) * ((BlockNumber) RELSEG_SIZE);
+		seg = umfile_v_get(ctx, forknum, curopensegs - 1);
+
+		if (priorblocks >= nblocks)
+		{
+			if (FileTruncate(seg->umfd_vfd, 0, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+							FilePathName(seg->umfd_vfd))));
+
+			if (seg != umfile_v_get(ctx, forknum, 0))
+			{
+				FileClose(seg->umfd_vfd);
+				umfile_fdvec_resize(ctx, forknum, curopensegs - 1);
+			}
+		}
+		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
+		{
+			BlockNumber	lastsegblocks;
+
+			lastsegblocks = nblocks - priorblocks;
+			if (FileTruncate(seg->umfd_vfd,
+							 (off_t) lastsegblocks * BLCKSZ,
+							 WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
+							FilePathName(seg->umfd_vfd),
+							nblocks)));
+		}
+		else
+			break;
+
+		curopensegs--;
+	}
+}
+
+void
+umfile_immedsync(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	Assert(ctx != NULL);
+
+	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+	min_inactive_seg = segno = ctx->num_open_segs[forknum];
+
+	while (umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		UmfdVec    *seg = umfile_v_get(ctx, forknum, segno - 1);
+
+		if (FileSync(seg->umfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->umfd_vfd))));
+
+		if (segno > min_inactive_seg)
+		{
+			FileClose(seg->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
+void
+umfile_registersync(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	/*
+	 * Registering durability at this boundary is implemented as an immediate
+	 * fsync.
+	 */
+	umfile_immedsync(ctx, forknum);
+}
+
+void
+umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+{
+	if (forknum == InvalidForkNumber)
+	{
+		for (forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+			umfile_unlink(rlocator, forknum, isRedo);
+		return;
+	}
+
+	for (BlockNumber segno = 0;; segno++)
+	{
+		RelPathStr	path;
+
+		path = umfile_segpath(rlocator, forknum, segno);
+		if (unlink(path.str) < 0)
+		{
+			if (FILE_POSSIBLY_DELETED(errno))
+			{
+				if (segno == 0 && isRedo)
+					return;
+				break;
+			}
+
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path.str)));
+			break;
+		}
+	}
+}
+
+static void
+umfile_ctx_registry_init(void)
+{
+	if (UmFileContextHash == NULL)
+		umfile_init();
+
+	Assert(UmFileContextHash != NULL);
+}
+
+static UmbraFileContext *
+umfile_ctx_create(RelFileLocatorBackend rlocator)
+{
+	UmbraFileContext *ctx;
+
+	Assert(UmFileCxt != NULL);
+
+	ctx = MemoryContextAllocZero(UmFileCxt, sizeof(UmbraFileContext));
+	ctx->rlocator = rlocator;
+
+	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+	{
+		ctx->num_open_segs[forknum] = 0;
+		ctx->seg_fds[forknum] = NULL;
+	}
+
+	return ctx;
+}
+
+static void
+umfile_ctx_destroy(UmbraFileContext *ctx)
+{
+	if (ctx == NULL)
+		return;
+
+	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+		umfile_close_open_segments(ctx, forknum);
+
+	pfree(ctx);
+}
+
+static void
+umfile_close_open_segments(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	int			nopensegs;
+
+	Assert(ctx != NULL);
+
+	nopensegs = ctx->num_open_segs[forknum];
+	while (nopensegs > 0)
+	{
+		UmfdVec    *seg = umfile_v_get(ctx, forknum, nopensegs - 1);
+
+		if (umfile_seg_entry_is_open(seg))
+			FileClose(seg->umfd_vfd);
+		umfile_fdvec_resize(ctx, forknum, nopensegs - 1);
+		nopensegs--;
+	}
+}
+
+static bool
+umfile_create(UmbraFileContext *ctx, ForkNumber forknum, bool isRedo)
+{
+	RelPathStr	path;
+	File		fd;
+	UmfdVec    *seg;
+	bool		created = false;
+
+	Assert(ctx != NULL);
+
+	if (isRedo && ctx->num_open_segs[forknum] > 0)
+		return false;
+
+	if (ctx->num_open_segs[forknum] > 0)
+		umfile_close_open_segments(ctx, forknum);
+
+	TablespaceCreateDbspace(ctx->rlocator.locator.spcOid,
+							ctx->rlocator.locator.dbOid,
+							isRedo);
+
+	path = umfile_segpath(ctx->rlocator, forknum, 0);
+	fd = PathNameOpenFile(path.str, umfile_open_flags() | O_CREAT | O_EXCL);
+	if (fd < 0)
+	{
+		int			save_errno = errno;
+
+		if (isRedo)
+			fd = PathNameOpenFile(path.str, umfile_open_flags());
+		if (fd < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path.str)));
+		}
+	}
+	else
+		created = true;
+
+	umfile_fdvec_resize(ctx, forknum, 1);
+	seg = umfile_v_get(ctx, forknum, 0);
+	seg->umfd_vfd = fd;
+	seg->umfd_segno = 0;
+
+	return created;
+}
+
+static int
+umfile_open_flags(void)
+{
+	int			flags = O_RDWR | PG_BINARY;
+
+	if (io_direct_flags & IO_DIRECT_DATA)
+		flags |= PG_O_DIRECT;
+
+	return flags;
+}
+
+static void
+umfile_fdvec_resize(UmbraFileContext *ctx, ForkNumber forknum, int nseg)
+{
+	Assert(ctx != NULL);
+	Assert(nseg >= 0);
+
+	if (nseg == 0)
+	{
+		if (ctx->num_open_segs[forknum] > 0)
+			pfree(ctx->seg_fds[forknum]);
+		ctx->seg_fds[forknum] = NULL;
+		ctx->num_open_segs[forknum] = 0;
+		return;
+	}
+
+	if (ctx->num_open_segs[forknum] == 0)
+	{
+		ctx->seg_fds[forknum] =
+			MemoryContextAlloc(UmFileCxt, sizeof(UmfdVec) * nseg);
+	}
+	else if (nseg > ctx->num_open_segs[forknum])
+	{
+		ctx->seg_fds[forknum] =
+			repalloc(ctx->seg_fds[forknum], sizeof(UmfdVec) * nseg);
+	}
+
+	ctx->num_open_segs[forknum] = nseg;
+}
+
+static inline UmfdVec *
+umfile_v_get(UmbraFileContext *ctx, ForkNumber forknum, int segindex)
+{
+	Assert(ctx != NULL);
+	Assert(segindex >= 0);
+	Assert(segindex < ctx->num_open_segs[forknum]);
+	return &ctx->seg_fds[forknum][segindex];
+}
+
+static BlockNumber
+umfile_nblocks_in_seg(File vfd)
+{
+	pgoff_t		size;
+
+	size = FileSize(vfd);
+	if (size < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not determine size of file \"%s\": %m",
+						FilePathName(vfd))));
+	if ((size % BLCKSZ) != 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("file \"%s\" has partial block contents",
+						FilePathName(vfd)),
+				 errdetail("File size %lld is not a multiple of %d bytes.",
+						   (long long) size, BLCKSZ)));
+
+	return (BlockNumber) (size / BLCKSZ);
+}
+
+static RelPathStr
+umfile_segpath(RelFileLocatorBackend rlocator, ForkNumber forknum,
+			   BlockNumber segno)
+{
+	RelPathStr	base;
+	RelPathStr	fullpath;
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		base = UmMetadataRelPathBackend(rlocator);
+	else
+		base = relpath(rlocator, forknum);
+
+	if (segno == 0)
+		return base;
+
+	snprintf(fullpath.str, sizeof(fullpath.str), "%s.%u", base.str, segno);
+	return fullpath;
+}
+
+static UmfdVec *
+umfile_openseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+			   ForkNumber forknum, BlockNumber segno, int oflags)
+{
+	UmfdVec    *seg;
+	RelPathStr	path;
+	File		fd;
+	int			old_nseg;
+
+	Assert(ctx != NULL);
+
+	old_nseg = ctx->num_open_segs[forknum];
+	if (segno < (BlockNumber) old_nseg)
+	{
+		seg = umfile_v_get(ctx, forknum, (int) segno);
+		if (umfile_seg_entry_is_open(seg))
+			return seg;
+	}
+
+	path = umfile_segpath(rlocator, forknum, segno);
+	fd = PathNameOpenFile(path.str, umfile_open_flags() | oflags);
+	if (fd < 0)
+		return NULL;
+
+	if (segno >= (BlockNumber) old_nseg)
+	{
+		umfile_fdvec_resize(ctx, forknum, segno + 1);
+		for (int i = old_nseg; i < ctx->num_open_segs[forknum]; i++)
+			umfile_seg_entry_reset(umfile_v_get(ctx, forknum, i));
+	}
+
+	seg = umfile_v_get(ctx, forknum, (int) segno);
+	seg->umfd_vfd = fd;
+	seg->umfd_segno = segno;
+
+	Assert(umfile_nblocks_in_seg(seg->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
+	return seg;
+}
+
+static UmfdVec *
+umfile_openfork(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+				ForkNumber forknum, int behavior)
+{
+	RelPathStr	path;
+	File		fd;
+	UmfdVec    *seg;
+
+	Assert(ctx != NULL);
+
+	if (ctx->num_open_segs[forknum] > 0)
+	{
+		seg = umfile_v_get(ctx, forknum, 0);
+		if (umfile_seg_entry_is_open(seg))
+			return seg;
+	}
+
+	path = umfile_segpath(rlocator, forknum, 0);
+	fd = PathNameOpenFile(path.str, umfile_open_flags());
+	if (fd < 0)
+	{
+		if ((behavior & UM_EXTENSION_RETURN_NULL) &&
+			FILE_POSSIBLY_DELETED(errno))
+			return NULL;
+
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path.str)));
+	}
+
+	if (ctx->num_open_segs[forknum] == 0)
+		umfile_fdvec_resize(ctx, forknum, 1);
+	seg = umfile_v_get(ctx, forknum, 0);
+	seg->umfd_vfd = fd;
+	seg->umfd_segno = 0;
+
+	Assert(umfile_nblocks_in_seg(seg->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
+	return seg;
+}
+
+static UmfdVec *
+umfile_getseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+			  ForkNumber forknum, BlockNumber blkno,
+			  bool skipFsync, int behavior)
+{
+	UmfdVec    *seg;
+	BlockNumber	targetseg;
+	BlockNumber	nextsegno;
+
+	Assert(ctx != NULL);
+	Assert(behavior &
+		   (UM_EXTENSION_FAIL | UM_EXTENSION_CREATE |
+			UM_EXTENSION_RETURN_NULL | UM_EXTENSION_DONT_OPEN));
+
+	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+
+	if (targetseg < (BlockNumber) ctx->num_open_segs[forknum])
+	{
+		seg = umfile_v_get(ctx, forknum, (int) targetseg);
+		if (umfile_seg_entry_is_open(seg))
+			return seg;
+	}
+
+	if (behavior & UM_EXTENSION_DONT_OPEN)
+		return NULL;
+
+	if (ctx->num_open_segs[forknum] > 0)
+		seg = umfile_v_get(ctx, forknum, ctx->num_open_segs[forknum] - 1);
+	else
+	{
+		seg = umfile_openfork(ctx, rlocator, forknum, behavior);
+		if (seg == NULL)
+			return NULL;
+	}
+
+	for (nextsegno = ctx->num_open_segs[forknum];
+		 nextsegno <= targetseg;
+		 nextsegno++)
+	{
+		BlockNumber	nblocks;
+		int			flags = 0;
+
+		Assert(nextsegno == seg->umfd_segno + 1);
+
+		nblocks = umfile_nblocks_in_seg(seg->umfd_vfd);
+		if (nblocks > (BlockNumber) RELSEG_SIZE)
+			elog(FATAL, "Umbra segment too big");
+
+		if ((behavior & UM_EXTENSION_CREATE) ||
+			(InRecovery && (behavior & UM_EXTENSION_CREATE_RECOVERY)))
+		{
+			if (nblocks < (BlockNumber) RELSEG_SIZE)
+			{
+				char	   *zerobuf;
+
+				zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+										 MCXT_ALLOC_ZERO);
+				umfile_extend(ctx, forknum,
+							  nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
+							  zerobuf, skipFsync);
+				pfree(zerobuf);
+			}
+			flags = O_CREAT;
+		}
+		else if (nblocks < (BlockNumber) RELSEG_SIZE)
+		{
+			if (behavior & UM_EXTENSION_RETURN_NULL)
+			{
+				errno = ENOENT;
+				return NULL;
+			}
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" (target block %u): previous segment is only %u blocks",
+							umfile_segpath(rlocator, forknum, nextsegno).str,
+							blkno, nblocks)));
+		}
+
+		seg = umfile_openseg(ctx, rlocator, forknum, nextsegno, flags);
+		if (seg == NULL)
+		{
+			if ((behavior & UM_EXTENSION_RETURN_NULL) &&
+				FILE_POSSIBLY_DELETED(errno))
+				return NULL;
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" (target block %u): %m",
+							umfile_segpath(rlocator, forknum, nextsegno).str,
+							blkno)));
+		}
+	}
+
+	return seg;
+}
+
+static bool
+umfile_fork_has_open_segment(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	Assert(ctx != NULL);
+
+	for (int i = 0; i < ctx->num_open_segs[forknum]; i++)
+	{
+		if (umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, i)))
+			return true;
+	}
+
+	return false;
+}
+
+static bool
+umfile_fork_has_open_segment_on_disk(UmbraFileContext *ctx,
+									 RelFileLocatorBackend rlocator,
+									 ForkNumber forknum)
+{
+	bool		have_live = false;
+
+	Assert(ctx != NULL);
+
+	for (int i = 0; i < ctx->num_open_segs[forknum]; i++)
+	{
+		UmfdVec    *seg = umfile_v_get(ctx, forknum, i);
+		RelPathStr	path;
+
+		if (!umfile_seg_entry_is_open(seg))
+			continue;
+
+		path = umfile_segpath(rlocator, forknum, seg->umfd_segno);
+		if (access(path.str, F_OK) == 0)
+		{
+			have_live = true;
+			continue;
+		}
+
+		FileClose(seg->umfd_vfd);
+		umfile_seg_entry_reset(seg);
+	}
+
+	return have_live;
+}
+
+static inline bool
+umfile_seg_entry_is_open(const UmfdVec *seg)
+{
+	return (seg != NULL && seg->umfd_vfd >= 0);
+}
+
+static inline void
+umfile_seg_entry_reset(UmfdVec *seg)
+{
+	seg->umfd_vfd = -1;
+	seg->umfd_segno = InvalidBlockNumber;
+}
diff --git a/src/include/storage/um_defs.h b/src/include/storage/um_defs.h
new file mode 100644
index 0000000000..3b567a397e
--- /dev/null
+++ b/src/include/storage/um_defs.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * um_defs.h
+ *	  Umbra low-level fork and metadata path definitions.
+ *
+ * This header contains storage-layout facts shared by Umbra submodules.
+ *
+ * src/include/storage/um_defs.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UM_DEFS_H
+#define UM_DEFS_H
+
+#include <stdio.h>
+
+#include "common/relpath.h"
+#include "storage/relfilelocator.h"
+
+/*
+ * Umbra reserves an extra fork slot for relation-local metadata.  This lives
+ * outside PostgreSQL's built-in fork numbering so ordinary smgr loops do not
+ * try to process it implicitly.
+ */
+#define UMBRA_METADATA_FORKNUM	((ForkNumber) (INIT_FORKNUM + 1))
+#define UMBRA_FORK_SLOTS		(UMBRA_METADATA_FORKNUM + 1)
+
+static inline RelPathStr
+UmMetadataRelPathBackend(RelFileLocatorBackend rlocator)
+{
+	RelPathStr	base;
+	RelPathStr	path;
+
+	base = relpath(rlocator, MAIN_FORKNUM);
+	snprintf(path.str, sizeof(path.str), "%s_map", base.str);
+	return path;
+}
+
+static inline RelPathStr
+UmMetadataRelPathPerm(RelFileLocator rlocator)
+{
+	RelFileLocatorBackend backend_rlocator;
+
+	backend_rlocator.locator = rlocator;
+	backend_rlocator.backend = INVALID_PROC_NUMBER;
+	return UmMetadataRelPathBackend(backend_rlocator);
+}
+
+#endif							/* UM_DEFS_H */
diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
index 9a2873f96d..30e033fcf0 100644
--- a/src/include/storage/umbra.h
+++ b/src/include/storage/umbra.h
@@ -17,6 +17,18 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
+#include "storage/um_defs.h"
+
+extern bool UmMetadataExists(SMgrRelation reln);
+extern bool UmMetadataOpenOrCreate(SMgrRelation reln, bool isRedo, bool *created);
+extern BlockNumber UmMetadataNblocks(SMgrRelation reln);
+extern void UmMetadataRead(SMgrRelation reln, BlockNumber blkno, void *buffer);
+extern void UmMetadataWrite(SMgrRelation reln, BlockNumber blkno,
+							const void *buffer, bool skipFsync);
+extern void UmMetadataExtend(SMgrRelation reln, BlockNumber blkno,
+							 const void *buffer, bool skipFsync);
+extern void UmMetadataImmediateSync(SMgrRelation reln);
+extern void UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo);

 extern void uminit(void);
 extern void umopen(SMgrRelation reln);
diff --git a/src/include/storage/umfile.h b/src/include/storage/umfile.h
new file mode 100644
index 0000000000..56936aa697
--- /dev/null
+++ b/src/include/storage/umfile.h
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * umfile.h
+ *	  Umbra backend-local file/context helpers.
+ *
+ * This layer owns backend-local file contexts keyed by RelFileLocatorBackend.
+ * It is the low-level file access boundary beneath Umbra metadata and mapping
+ * code.
+ *
+ * src/include/storage/umfile.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UMFILE_H
+#define UMFILE_H
+
+#include "storage/aio_types.h"
+#include "storage/block.h"
+#include "storage/fd.h"
+#include "storage/relfilelocator.h"
+#include "storage/um_defs.h"
+
+typedef struct UmbraFileContext UmbraFileContext;
+
+typedef enum UmFileNblocksMode
+{
+	UMFILE_NBLOCKS_DENSE,
+	UMFILE_NBLOCKS_SPARSE
+} UmFileNblocksMode;
+
+typedef enum UmFileExistsMode
+{
+	UMFILE_EXISTS_DENSE,
+	UMFILE_EXISTS_SPARSE
+} UmFileExistsMode;
+
+extern void umfile_init(void);
+
+extern UmbraFileContext *umfile_ctx_lookup(RelFileLocatorBackend rlocator);
+extern UmbraFileContext *umfile_ctx_acquire(RelFileLocatorBackend rlocator);
+extern UmbraFileContext *umfile_ctx_create_temporary(RelFileLocatorBackend rlocator);
+extern void umfile_ctx_destroy_temporary(UmbraFileContext *ctx);
+extern void umfile_ctx_release(RelFileLocatorBackend rlocator);
+extern void umfile_ctx_forget(RelFileLocatorBackend rlocator);
+extern void umfile_ctx_close_fork(UmbraFileContext *ctx, ForkNumber forknum);
+
+extern bool umfile_ctx_fork_exists(UmbraFileContext *ctx, ForkNumber forknum,
+								   UmFileExistsMode mode);
+extern BlockNumber umfile_ctx_get_nblocks(UmbraFileContext *ctx,
+										  ForkNumber forknum,
+										  UmFileNblocksMode mode);
+extern void umfile_ctx_read(UmbraFileContext *ctx, ForkNumber forknum,
+							BlockNumber blkno, char *buffer, int nbytes);
+extern void umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum,
+							 BlockNumber blkno, const char *buffer,
+							 int nbytes, bool skipFsync);
+extern void umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum,
+							  BlockNumber blkno, const char *buffer);
+extern void umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator,
+								  ForkNumber forknum, bool isRedo);
+
+extern bool umfile_exists(UmbraFileContext *ctx, ForkNumber forknum,
+						  UmFileExistsMode mode);
+extern bool umfile_open_or_create(UmbraFileContext *ctx, ForkNumber forknum,
+								  bool isRedo, bool *created);
+extern BlockNumber umfile_nblocks(UmbraFileContext *ctx, ForkNumber forknum,
+								  UmFileNblocksMode mode);
+extern void umfile_readv(UmbraFileContext *ctx, ForkNumber forknum,
+						 BlockNumber blocknum, void **buffers,
+						 BlockNumber nblocks);
+extern void umfile_writev(UmbraFileContext *ctx, ForkNumber forknum,
+						  BlockNumber blocknum, const void **buffers,
+						  BlockNumber nblocks, bool skipFsync);
+extern void umfile_extend(UmbraFileContext *ctx, ForkNumber forknum,
+						  BlockNumber blocknum, const void *buffer,
+						  bool skipFsync);
+extern void umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum,
+							  BlockNumber blocknum, int nblocks,
+							  bool skipFsync);
+extern void umfile_truncate(UmbraFileContext *ctx, ForkNumber forknum,
+							BlockNumber old_blocks, BlockNumber nblocks);
+extern void umfile_immedsync(UmbraFileContext *ctx, ForkNumber forknum);
+extern void umfile_registersync(UmbraFileContext *ctx, ForkNumber forknum);
+extern void umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum,
+						  bool isRedo);
+
+/* Metadata-only convenience wrappers over the generic umfile surface. */
+extern bool umfile_metadata_exists(UmbraFileContext *ctx);
+extern bool umfile_metadata_open_or_create(UmbraFileContext *ctx,
+										   bool isRedo, bool *created);
+extern BlockNumber umfile_metadata_nblocks(UmbraFileContext *ctx);
+extern void umfile_metadata_read(UmbraFileContext *ctx, BlockNumber blkno,
+								 void *buffer);
+extern void umfile_metadata_write(UmbraFileContext *ctx, BlockNumber blkno,
+								  const void *buffer);
+extern void umfile_metadata_extend(UmbraFileContext *ctx, BlockNumber blkno,
+								   const void *buffer);
+extern void umfile_metadata_immedsync(UmbraFileContext *ctx);
+extern void umfile_metadata_unlink(RelFileLocatorBackend rlocator, bool isRedo);
+
+#endif							/* UMFILE_H */
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 04/10] umbra: add patch 3 metadata disk format and identity mapping bootstrap

---
src/backend/catalog/storage.c | 3 +
src/backend/storage/Makefile | 5 +
src/backend/storage/buffer/bufmgr.c | 2 +
src/backend/storage/map/Makefile | 19 ++
src/backend/storage/map/map.c | 162 +++++++++++++
src/backend/storage/map/mapsuper.c | 338 ++++++++++++++++++++++++++++
src/backend/storage/map/meson.build | 6 +
src/backend/storage/meson.build | 3 +
src/backend/storage/smgr/smgr.c | 47 ++++
src/backend/storage/smgr/umbra.c | 158 ++++++++++++-
src/include/storage/map.h | 53 +++++
src/include/storage/mapsuper.h | 100 ++++++++
src/include/storage/smgr.h | 6 +
src/include/storage/umbra.h | 7 +
14 files changed, 906 insertions(+), 3 deletions(-)
create mode 100644 src/backend/storage/map/Makefile
create mode 100644 src/backend/storage/map/map.c
create mode 100644 src/backend/storage/map/mapsuper.c
create mode 100644 src/backend/storage/map/meson.build
create mode 100644 src/include/storage/map.h
create mode 100644 src/include/storage/mapsuper.h

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e443a4993c..6b69329a52 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -150,6 +150,8 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
 	srel = smgropen(rlocator, procNumber);
 	smgrcreate(srel, MAIN_FORKNUM, false);

+	if (needs_wal)
+		smgrcreaterelationmetadata(srel);
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);

@@ -1014,6 +1016,7 @@ smgr_redo(XLogReaderState *record)
* log as best we can until the drop is seen.
*/
smgrcreate(reln, MAIN_FORKNUM, true);
+ smgrcreaterelationmetadata(reln);

 		/*
 		 * Before we perform the truncation, update minimum recovery point to
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 2afb42ca96..b07ba46dbb 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -20,4 +20,9 @@ SUBDIRS = \
 	smgr \
 	sync

+ifeq ($(with_umbra), yes)
+SUBDIRS += \
+	map
+endif
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd9..540f346d53 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5505,6 +5505,8 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 										   permanent);
 		}
 	}
+
+	smgrcopyrelationmetadata(src_rel, dst_rel, relpersistence);
 }

 /* ---------------------------------------------------------------------
diff --git a/src/backend/storage/map/Makefile b/src/backend/storage/map/Makefile
new file mode 100644
index 0000000000..ee9603de14
--- /dev/null
+++ b/src/backend/storage/map/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/map (Umbra mapping subsystem)
+#
+# IDENTIFICATION
+#    src/backend/storage/map/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/map
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	map.o \
+	mapsuper.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
new file mode 100644
index 0000000000..563f38b21a
--- /dev/null
+++ b/src/backend/storage/map/map.c
@@ -0,0 +1,162 @@
+/*-------------------------------------------------------------------------
+ *
+ * map.c
+ *	  Umbra metadata-fork disk layout helpers.
+ *
+ * This file contains address-translation and in-page access routines for the
+ * metadata fork disk layout.
+ *
+ * src/backend/storage/map/map.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/map.h"
+#include "storage/um_defs.h"
+
+void
+MapPageInit(MapPage *page)
+{
+	Assert(page != NULL);
+
+	MemSet(page->pblknos, 0xFF, sizeof(page->pblknos));
+}
+
+BlockNumber
+MapPageGetEntry(const MapPage *page, int entry_idx)
+{
+	Assert(page != NULL);
+
+	if (entry_idx < 0 || entry_idx >= MAP_ENTRIES_PER_PAGE)
+		elog(ERROR, "map entry index %d is out of range", entry_idx);
+
+	return page->pblknos[entry_idx];
+}
+
+void
+MapPageSetEntry(MapPage *page, int entry_idx, BlockNumber pblkno)
+{
+	Assert(page != NULL);
+
+	if (entry_idx < 0 || entry_idx >= MAP_ENTRIES_PER_PAGE)
+		elog(ERROR, "map entry index %d is out of range", entry_idx);
+
+	page->pblknos[entry_idx] = pblkno;
+}
+
+BlockNumber
+MapForkPageIndexToMapBlkno(ForkNumber forknum, BlockNumber fork_page_idx)
+{
+	uint64		group_no;
+	uint64		blkno64;
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		elog(ERROR, "Umbra metadata fork cannot be addressed as a map target");
+
+	switch (forknum)
+	{
+		case FSM_FORKNUM:
+			group_no = (uint64) fork_page_idx;
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				group_no * (uint64) MAP_GROUP_TOTAL_PAGES;
+			break;
+
+		case VISIBILITYMAP_FORKNUM:
+			group_no = (uint64) fork_page_idx;
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				group_no * (uint64) MAP_GROUP_TOTAL_PAGES +
+				(uint64) MAP_GROUP_FSM_PAGES;
+			break;
+
+		case MAIN_FORKNUM:
+		{
+			uint64		group_page_idx = (uint64) fork_page_idx;
+
+			group_no = group_page_idx / (uint64) MAP_GROUP_MAIN_PAGES;
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				group_no * (uint64) MAP_GROUP_TOTAL_PAGES +
+				(uint64) MAP_GROUP_FSM_PAGES +
+				(uint64) MAP_GROUP_VM_PAGES +
+				(group_page_idx % (uint64) MAP_GROUP_MAIN_PAGES);
+			break;
+		}
+
+		default:
+			elog(ERROR, "unsupported fork number %d in map layout", (int) forknum);
+			pg_unreachable();
+	}
+
+	if (blkno64 > (uint64) MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot address map page %u for fork %d",
+						fork_page_idx, forknum)));
+
+	return (BlockNumber) blkno64;
+}
+
+BlockNumber
+MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno)
+{
+	BlockNumber	fork_page_idx;
+	uint64		entry64;
+
+	fork_page_idx = lblkno / MAP_ENTRIES_PER_PAGE;
+	entry64 = (uint64) MapForkPageIndexToMapBlkno(forknum, fork_page_idx) *
+		(uint64) MAP_ENTRIES_PER_PAGE +
+		(uint64) (lblkno % MAP_ENTRIES_PER_PAGE);
+
+	if (entry64 > (uint64) MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot address logical block %u for fork %d in map",
+						lblkno, forknum)));
+
+	return (BlockNumber) entry64;
+}
+
+bool
+MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,
+				  BlockNumber *fork_page_idx)
+{
+	uint64		offset;
+	uint64		group_no;
+	uint64		in_group;
+
+	Assert(forknum != NULL);
+	Assert(fork_page_idx != NULL);
+
+	if (map_blkno == MAP_BLOCK_SUPER || map_blkno < MAP_BLOCK_FIRST_GROUP)
+		return false;
+
+	offset = (uint64) (map_blkno - MAP_BLOCK_FIRST_GROUP);
+	group_no = offset / (uint64) MAP_GROUP_TOTAL_PAGES;
+	in_group = offset % (uint64) MAP_GROUP_TOTAL_PAGES;
+
+	if (in_group < (uint64) MAP_GROUP_FSM_PAGES)
+	{
+		*forknum = FSM_FORKNUM;
+		*fork_page_idx = (BlockNumber) group_no;
+		return true;
+	}
+
+	in_group -= (uint64) MAP_GROUP_FSM_PAGES;
+	if (in_group < (uint64) MAP_GROUP_VM_PAGES)
+	{
+		*forknum = VISIBILITYMAP_FORKNUM;
+		*fork_page_idx = (BlockNumber) group_no;
+		return true;
+	}
+
+	in_group -= (uint64) MAP_GROUP_VM_PAGES;
+	if (in_group < (uint64) MAP_GROUP_MAIN_PAGES)
+	{
+		*forknum = MAIN_FORKNUM;
+		*fork_page_idx = (BlockNumber)
+			(group_no * (uint64) MAP_GROUP_MAIN_PAGES + in_group);
+		return true;
+	}
+
+	return false;
+}
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
new file mode 100644
index 0000000000..b376d513fd
--- /dev/null
+++ b/src/backend/storage/map/mapsuper.c
@@ -0,0 +1,338 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapsuper.c
+ *	  Umbra metadata superblock helpers.
+ *
+ * This file contains on-disk superblock encoding and direct metadata-file I/O
+ * helpers.
+ *
+ * src/backend/storage/map/mapsuper.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/map.h"
+#include "storage/mapsuper.h"
+#include "storage/umbra.h"
+
+static void MapSBlockReportCorrupt(SMgrRelation reln, const char *reason);
+
+void
+MapSuperblockRefreshCRC(MapSuperblock *super)
+{
+	pg_crc32c	crc;
+
+	Assert(super != NULL);
+
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &super->data, offsetof(MapSuperblockData, crc));
+	FIN_CRC32C(crc);
+	super->data.crc = crc;
+}
+
+bool
+MapSuperblockCheckCRC(const MapSuperblock *super)
+{
+	pg_crc32c	crc;
+
+	Assert(super != NULL);
+
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &super->data, offsetof(MapSuperblockData, crc));
+	FIN_CRC32C(crc);
+
+	return crc == super->data.crc;
+}
+
+void
+MapSuperblockInit(MapSuperblock *super, uint32 flags)
+{
+	Assert(super != NULL);
+
+	MemSet(super, 0, sizeof(*super));
+
+	super->data.magic = MAP_SUPERBLOCK_MAGIC;
+	super->data.version = MAP_SUPERBLOCK_VERSION;
+	super->data.blcksz = BLCKSZ;
+	super->data.flags = flags;
+	super->data.next_free_phys_block_fsm = InvalidBlockNumber;
+	super->data.phys_capacity_fsm = InvalidBlockNumber;
+	super->data.next_free_phys_block_vm = InvalidBlockNumber;
+	super->data.phys_capacity_vm = InvalidBlockNumber;
+	super->data.logical_nblocks_fsm = InvalidBlockNumber;
+	super->data.logical_nblocks_vm = InvalidBlockNumber;
+	super->data.last_updated_lsn = InvalidXLogRecPtr;
+	super->data.crc = 0;
+}
+
+bool
+MapSuperblockHasValidIdentity(const MapSuperblock *super)
+{
+	Assert(super != NULL);
+
+	if (super->data.magic != MAP_SUPERBLOCK_MAGIC)
+		return false;
+	if (super->data.version != MAP_SUPERBLOCK_VERSION)
+		return false;
+	if (super->data.blcksz != BLCKSZ)
+		return false;
+
+	return true;
+}
+
+bool
+MapSuperblockIsValid(const MapSuperblock *super)
+{
+	Assert(super != NULL);
+
+	if (!MapSuperblockHasValidIdentity(super))
+		return false;
+
+	return MapSuperblockCheckCRC(super);
+}
+
+void
+MapSuperblockSetFlags(MapSuperblock *super, uint32 flags)
+{
+	Assert(super != NULL);
+
+	super->data.flags = flags;
+}
+
+uint32
+MapSuperblockGetFlags(const MapSuperblock *super)
+{
+	Assert(super != NULL);
+
+	return super->data.flags;
+}
+
+void
+MapSuperblockSetLastUpdatedLSN(MapSuperblock *super, XLogRecPtr lsn)
+{
+	Assert(super != NULL);
+
+	super->data.last_updated_lsn = lsn;
+}
+
+XLogRecPtr
+MapSuperblockGetLastUpdatedLSN(const MapSuperblock *super)
+{
+	Assert(super != NULL);
+
+	return super->data.last_updated_lsn;
+}
+
+BlockNumber
+MapSuperblockGetNextFreePhysBlock(const MapSuperblock *super, ForkNumber forknum)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return super->data.next_free_phys_block_main;
+		case FSM_FORKNUM:
+			return super->data.next_free_phys_block_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return super->data.next_free_phys_block_vm;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+
+	pg_unreachable();
+}
+
+void
+MapSuperblockSetNextFreePhysBlock(MapSuperblock *super, ForkNumber forknum,
+								  BlockNumber blkno)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			super->data.next_free_phys_block_main = blkno;
+			break;
+		case FSM_FORKNUM:
+			super->data.next_free_phys_block_fsm = blkno;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			super->data.next_free_phys_block_vm = blkno;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+}
+
+BlockNumber
+MapSuperblockGetPhysCapacity(const MapSuperblock *super, ForkNumber forknum)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return super->data.phys_capacity_main;
+		case FSM_FORKNUM:
+			return super->data.phys_capacity_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return super->data.phys_capacity_vm;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+
+	pg_unreachable();
+}
+
+void
+MapSuperblockSetPhysCapacity(MapSuperblock *super, ForkNumber forknum,
+							 BlockNumber blkno)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			super->data.phys_capacity_main = blkno;
+			break;
+		case FSM_FORKNUM:
+			super->data.phys_capacity_fsm = blkno;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			super->data.phys_capacity_vm = blkno;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+}
+
+BlockNumber
+MapSuperblockGetLogicalNblocks(const MapSuperblock *super, ForkNumber forknum)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return super->data.logical_nblocks_main;
+		case FSM_FORKNUM:
+			return super->data.logical_nblocks_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return super->data.logical_nblocks_vm;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+
+	pg_unreachable();
+}
+
+void
+MapSuperblockSetLogicalNblocks(MapSuperblock *super, ForkNumber forknum,
+							   BlockNumber nblocks)
+{
+	Assert(super != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			super->data.logical_nblocks_main = nblocks;
+			break;
+		case FSM_FORKNUM:
+			super->data.logical_nblocks_fsm = nblocks;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			super->data.logical_nblocks_vm = nblocks;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for superblock: %d", forknum);
+	}
+}
+
+void
+MapSuperblockPackPage(const MapSuperblock *super, char page[BLCKSZ])
+{
+	Assert(super != NULL);
+	Assert(page != NULL);
+
+	MemSet(page, 0, BLCKSZ);
+	memcpy(page, super->padding, MAP_SUPERBLOCK_SIZE);
+}
+
+void
+MapSuperblockUnpackPage(MapSuperblock *super, const char page[BLCKSZ])
+{
+	Assert(super != NULL);
+	Assert(page != NULL);
+
+	memcpy(super->padding, page, MAP_SUPERBLOCK_SIZE);
+}
+
+bool
+MapSBlockRead(SMgrRelation reln, MapSuperblock *super)
+{
+	char		page[BLCKSZ];
+
+	Assert(reln != NULL);
+	Assert(super != NULL);
+
+	if (!UmMetadataExists(reln))
+		return false;
+
+	if (UmMetadataNblocks(reln) == 0)
+		return false;
+
+	UmMetadataRead(reln, MAP_BLOCK_SUPER, page);
+	MapSuperblockUnpackPage(super, page);
+
+	if (!MapSuperblockHasValidIdentity(super))
+		MapSBlockReportCorrupt(reln, "invalid identity");
+	if (!MapSuperblockCheckCRC(super))
+		MapSBlockReportCorrupt(reln, "CRC mismatch");
+
+	return true;
+}
+
+void
+MapSBlockWrite(SMgrRelation reln, const MapSuperblock *super, bool skipFsync)
+{
+	MapSuperblock write_super;
+	char		page[BLCKSZ];
+
+	Assert(reln != NULL);
+	Assert(super != NULL);
+
+	write_super = *super;
+	MapSuperblockRefreshCRC(&write_super);
+	MapSuperblockPackPage(&write_super, page);
+
+	if (!UmMetadataOpenOrCreate(reln, false, NULL))
+		elog(ERROR, "could not open Umbra metadata file for superblock write");
+
+	if (UmMetadataNblocks(reln) == 0)
+		UmMetadataExtend(reln, MAP_BLOCK_SUPER, page, skipFsync);
+	else
+		UmMetadataWrite(reln, MAP_BLOCK_SUPER, page, skipFsync);
+}
+
+void
+MapSBlockInitNew(SMgrRelation reln, uint32 flags, XLogRecPtr lsn, bool skipFsync)
+{
+	MapSuperblock super;
+
+	MapSuperblockInit(&super, flags);
+	MapSuperblockSetLastUpdatedLSN(&super, lsn);
+	MapSBlockWrite(reln, &super, skipFsync);
+}
+
+static void
+MapSBlockReportCorrupt(SMgrRelation reln, const char *reason)
+{
+	RelFileLocator rlocator = reln->smgr_rlocator.locator;
+
+	ereport(ERROR,
+			(errcode(ERRCODE_DATA_CORRUPTED),
+			 errmsg("Umbra metadata superblock is corrupted for relation %u/%u/%u: %s",
+					rlocator.spcOid, rlocator.dbOid, rlocator.relNumber, reason)));
+}
diff --git a/src/backend/storage/map/meson.build b/src/backend/storage/map/meson.build
new file mode 100644
index 0000000000..0f780fe522
--- /dev/null
+++ b/src/backend/storage/map/meson.build
@@ -0,0 +1,6 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'map.c',
+  'mapsuper.c',
+)
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 05637aa3a4..2f80f3f575 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -7,6 +7,9 @@ subdir('freespace')
 subdir('ipc')
 subdir('large_object')
 subdir('lmgr')
+if get_option('umbra').enabled()
+  subdir('map')
+endif
 subdir('page')
 subdir('smgr')
 subdir('sync')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index a7b70d856c..c9a3ef6461 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -127,6 +127,13 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_create_relation_metadata) (SMgrRelation reln);
+	void		(*smgr_copy_relation_metadata) (SMgrRelation src,
+												SMgrRelation dst,
+												char relpersistence);
+	void		(*smgr_sync_relation_metadata) (SMgrRelation reln);
+	void		(*smgr_unlink_relation_metadata) (RelFileLocatorBackend rlocator,
+												  bool isRedo);
 	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;

@@ -161,6 +168,10 @@ static const f_smgr smgrsw[] = {
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_create_relation_metadata = NULL,
+		.smgr_copy_relation_metadata = NULL,
+		.smgr_sync_relation_metadata = NULL,
+		.smgr_unlink_relation_metadata = NULL,
 		.smgr_fd = mdfd,
 	},
 #ifdef USE_UMBRA
@@ -186,6 +197,10 @@ static const f_smgr smgrsw[] = {
 		.smgr_truncate = umtruncate,
 		.smgr_immedsync = umimmedsync,
 		.smgr_registersync = umregistersync,
+		.smgr_create_relation_metadata = umcreaterelationmetadata,
+		.smgr_copy_relation_metadata = umcopyrelationmetadata,
+		.smgr_sync_relation_metadata = umsyncrelationmetadata,
+		.smgr_unlink_relation_metadata = umunlinkrelationmetadata,
 		.smgr_fd = umfd,
 	},
 #endif
@@ -529,6 +544,34 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	RESUME_INTERRUPTS();
 }

+void
+smgrcreaterelationmetadata(SMgrRelation reln)
+{
+	if (smgrsw[reln->smgr_which].smgr_create_relation_metadata)
+		smgrsw[reln->smgr_which].smgr_create_relation_metadata(reln);
+}
+
+void
+smgrcopyrelationmetadata(SMgrRelation src, SMgrRelation dst, char relpersistence)
+{
+	if (smgrsw[dst->smgr_which].smgr_copy_relation_metadata)
+		smgrsw[dst->smgr_which].smgr_copy_relation_metadata(src, dst,
+															relpersistence);
+}
+
+void
+smgrsyncrelationmetadata(SMgrRelation reln)
+{
+	if (smgrsw[reln->smgr_which].smgr_sync_relation_metadata)
+		smgrsw[reln->smgr_which].smgr_sync_relation_metadata(reln);
+}
+
+void
+smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
+{
+	if (smgrsw[0].smgr_unlink_relation_metadata)
+		smgrsw[0].smgr_unlink_relation_metadata(rlocator, isRedo);
+}
 /*
  * smgrdosyncall() -- Immediately sync all forks of all given relations
  *
@@ -563,6 +606,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 			if (smgrsw[which].smgr_exists(rels[i], forknum))
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
+
+		smgrsyncrelationmetadata(rels[i]);
 	}

RESUME_INTERRUPTS();
@@ -643,6 +688,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)

 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 			smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+
+		smgrunlinkrelationmetadata(rlocators[i], isRedo);
 	}

 	pfree(rlocators);
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index 2c08231587..fc6e480276 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -4,8 +4,9 @@
  *	  Umbra storage manager skeleton.
  *
  * This file establishes Umbra as a separate smgr implementation from md.c.
- * Data-fork operations remain md-backed here, while relation-local metadata
- * file operations go through umfile.
+ * maintains identity mapping state (logical block number == physical block
+ * number) in the relation-local metadata file while using md.c for data-fork
+ * I/O and umfile for metadata-file I/O.
  *
  * src/backend/storage/smgr/umbra.c
  *
@@ -13,7 +14,9 @@
  */
 #include "postgres.h"

+#include "catalog/pg_class.h"
#include "storage/md.h"
+#include "storage/mapsuper.h"
#include "storage/smgr.h"
#include "storage/umfile.h"
#include "storage/umbra.h"
@@ -24,7 +27,11 @@ typedef struct UmbraSmgrRelationState
UmbraFileContext *filectx;
} UmbraSmgrRelationState;

+static bool um_tracks_identity_metadata(ForkNumber forknum);
 static UmbraFileContext *um_relation_filectx(SMgrRelation reln);
+static void um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
+										BlockNumber nblocks, bool fork_exists,
+										bool skipFsync);

bool
UmMetadataExists(SMgrRelation reln)
@@ -124,7 +131,7 @@ umdestroy(SMgrRelation reln)
{
UmbraSmgrRelationState *state = reln->smgr_private;

-	umfile_ctx_forget(reln->smgr_rlocator);
+	umfile_ctx_release(reln->smgr_rlocator);

if (state != NULL)
{
@@ -133,15 +140,94 @@ umdestroy(SMgrRelation reln)
}
}

+bool
+umisinternalfork(ForkNumber forknum)
+{
+	return forknum == UMBRA_METADATA_FORKNUM;
+}
+
+void
+umcreaterelationmetadata(SMgrRelation reln)
+{
+	bool		created = false;
+
+	if (!UmMetadataOpenOrCreate(reln, false, &created))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create Umbra metadata fork for relation %u/%u/%u",
+						reln->smgr_rlocator.locator.spcOid,
+						reln->smgr_rlocator.locator.dbOid,
+						reln->smgr_rlocator.locator.relNumber)));
+}
+
+void
+umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst, char relpersistence)
+{
+	BlockNumber src_nblocks;
+	BlockNumber dst_nblocks;
+	PGIOAlignedBlock pagebuf;
+	bool		created = false;
+
+	if (relpersistence != RELPERSISTENCE_PERMANENT)
+		return;
+
+	if (!UmMetadataExists(src))
+		return;
+
+	if (!UmMetadataOpenOrCreate(dst, false, &created))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create Umbra metadata fork for relation %u/%u/%u",
+						dst->smgr_rlocator.locator.spcOid,
+						dst->smgr_rlocator.locator.dbOid,
+						dst->smgr_rlocator.locator.relNumber)));
+
+	src_nblocks = UmMetadataNblocks(src);
+	dst_nblocks = UmMetadataNblocks(dst);
+
+	for (BlockNumber blkno = 0; blkno < src_nblocks; blkno++)
+	{
+		UmMetadataRead(src, blkno, pagebuf.data);
+		if (blkno < dst_nblocks)
+			UmMetadataWrite(dst, blkno, pagebuf.data, true);
+		else
+			UmMetadataExtend(dst, blkno, pagebuf.data, true);
+	}
+
+	UmMetadataImmediateSync(dst);
+}
+
+void
+umsyncrelationmetadata(SMgrRelation reln)
+{
+	if (!UmMetadataExists(reln))
+		return;
+
+	UmMetadataImmediateSync(reln);
+}
+
+void
+umunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
+{
+	umfile_ctx_forget(rlocator);
+	UmMetadataUnlink(rlocator, isRedo);
+}
+
 void
 umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
 	mdcreate(reln, forknum, isRedo);
+
+	if (um_tracks_identity_metadata(forknum))
+		um_identity_update_metadata(reln, forknum, 0, true, true);
 }

bool
umexists(SMgrRelation reln, ForkNumber forknum)
{
+ if (forknum == UMBRA_METADATA_FORKNUM)
+ return UmMetadataExists(reln);
+
return mdexists(reln, forknum);
}

@@ -167,13 +253,30 @@ umextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
 	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+
+	if (um_tracks_identity_metadata(forknum))
+		um_identity_update_metadata(reln, forknum, blocknum + 1, true,
+									skipFsync);
 }

 void
 umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks, bool skipFsync)
 {
+	BlockNumber	target_nblocks;
+
 	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+
+	if (um_tracks_identity_metadata(forknum))
+	{
+		target_nblocks = blocknum + (BlockNumber) nblocks;
+		if (target_nblocks < blocknum)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("Umbra identity mapping block count overflow")));
+		um_identity_update_metadata(reln, forknum, target_nblocks, true,
+									skipFsync);
+	}
 }

 bool
@@ -220,6 +323,11 @@ umwriteback(SMgrRelation reln, ForkNumber forknum,
 BlockNumber
 umnblocks(SMgrRelation reln, ForkNumber forknum)
 {
+	/*
+	 * Keep md.c responsible for the physical fork size query. mdtruncate()
+	 * relies on a preceding mdnblocks() call to have opened all active
+	 * segments.
+	 */
 	return mdnblocks(reln, forknum);
 }

@@ -228,12 +336,18 @@ umtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber old_blocks, BlockNumber nblocks)
 {
 	mdtruncate(reln, forknum, old_blocks, nblocks);
+
+	if (um_tracks_identity_metadata(forknum))
+		um_identity_update_metadata(reln, forknum, nblocks, true, false);
 }

 void
 umimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
 	mdimmedsync(reln, forknum);
+
+	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
+		UmMetadataImmediateSync(reln);
 }

void
@@ -261,3 +375,41 @@ um_relation_filectx(SMgrRelation reln)

 	return state->filectx;
 }
+
+static bool
+um_tracks_identity_metadata(ForkNumber forknum)
+{
+	return forknum == MAIN_FORKNUM ||
+		forknum == FSM_FORKNUM ||
+		forknum == VISIBILITYMAP_FORKNUM;
+}
+
+static void
+um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber nblocks, bool fork_exists,
+							bool skipFsync)
+{
+	MapSuperblock super;
+
+	Assert(reln != NULL);
+	Assert(um_tracks_identity_metadata(forknum));
+
+	if (!MapSBlockRead(reln, &super))
+		MapSuperblockInit(&super, 0);
+
+	if (!fork_exists && forknum != MAIN_FORKNUM)
+	{
+		MapSuperblockSetLogicalNblocks(&super, forknum, InvalidBlockNumber);
+		MapSuperblockSetNextFreePhysBlock(&super, forknum, InvalidBlockNumber);
+		MapSuperblockSetPhysCapacity(&super, forknum, InvalidBlockNumber);
+	}
+	else
+	{
+		MapSuperblockSetLogicalNblocks(&super, forknum, nblocks);
+		MapSuperblockSetNextFreePhysBlock(&super, forknum, nblocks);
+		MapSuperblockSetPhysCapacity(&super, forknum, nblocks);
+	}
+
+	MapSuperblockSetLastUpdatedLSN(&super, InvalidXLogRecPtr);
+	MapSBlockWrite(reln, &super, skipFsync);
+}
diff --git a/src/include/storage/map.h b/src/include/storage/map.h
new file mode 100644
index 0000000000..b0887794c3
--- /dev/null
+++ b/src/include/storage/map.h
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * map.h
+ *	  Umbra metadata-fork disk layout helpers.
+ *
+ * This header defines the stable on-disk page layout and address translation
+ * helpers for Umbra's relation-local metadata file.
+ *
+ * src/include/storage/map.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MAP_H
+#define MAP_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+#define MAP_ENTRIES_PER_PAGE (BLCKSZ / sizeof(uint32))
+
+/*
+ * Umbra metadata file page layout:
+ * - block 0: superblock payload
+ * - blocks 1..: repeated proportional groups
+ *
+ * Each group reserves one FSM map page, one VM map page, and 8192 MAIN map
+ * pages. That keeps the mapping formula stable while leaving room for the
+ * auxiliary forks to grow alongside MAIN.
+ */
+#define MAP_BLOCK_SUPER		0
+#define MAP_BLOCK_FIRST_GROUP	1
+#define MAP_GROUP_FSM_PAGES	1
+#define MAP_GROUP_VM_PAGES	1
+#define MAP_GROUP_MAIN_PAGES	8192
+#define MAP_GROUP_TOTAL_PAGES \
+	(MAP_GROUP_FSM_PAGES + MAP_GROUP_VM_PAGES + MAP_GROUP_MAIN_PAGES)
+
+typedef struct MapPage
+{
+	uint32		pblknos[MAP_ENTRIES_PER_PAGE];
+} MapPage;
+
+extern void MapPageInit(MapPage *page);
+extern BlockNumber MapPageGetEntry(const MapPage *page, int entry_idx);
+extern void MapPageSetEntry(MapPage *page, int entry_idx, BlockNumber pblkno);
+
+extern BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
+											  BlockNumber fork_page_idx);
+extern BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
+extern bool MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,
+							  BlockNumber *fork_page_idx);
+
+#endif							/* MAP_H */
diff --git a/src/include/storage/mapsuper.h b/src/include/storage/mapsuper.h
new file mode 100644
index 0000000000..1f6a5dca5a
--- /dev/null
+++ b/src/include/storage/mapsuper.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapsuper.h
+ *	  Umbra metadata superblock helpers.
+ *
+ * The superblock is stored in metadata block 0. Its first 512 bytes contain a
+ * versioned payload plus CRC, and the remainder of the block is reserved.
+ *
+ * src/include/storage/mapsuper.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MAPSUPER_H
+#define MAPSUPER_H
+
+#include "access/xlogdefs.h"
+#include "port/pg_crc32c.h"
+#include "storage/block.h"
+#include "storage/smgr.h"
+
+#define MAP_SUPERBLOCK_MAGIC		0x554D4252U	/* "UMBR" */
+#define MAP_SUPERBLOCK_VERSION		1U
+#define MAP_SUPERBLOCK_SIZE			512
+#define MAP_SUPERBLOCK_PAYLOAD_SIZE 64
+
+#define MAP_SUPERBLOCK_FLAG_SKIP_WAL_PENDING 0x00000001U
+
+typedef struct pg_attribute_packed() MapSuperblockData
+{
+	uint32		magic;
+	uint32		version;
+	uint32		blcksz;
+	uint32		flags;
+
+	BlockNumber next_free_phys_block_main;
+	BlockNumber phys_capacity_main;
+	BlockNumber next_free_phys_block_fsm;
+	BlockNumber phys_capacity_fsm;
+	BlockNumber next_free_phys_block_vm;
+	BlockNumber phys_capacity_vm;
+
+	BlockNumber logical_nblocks_main;
+	BlockNumber logical_nblocks_fsm;
+	BlockNumber logical_nblocks_vm;
+
+	XLogRecPtr	last_updated_lsn;
+	pg_crc32c	crc;
+} MapSuperblockData;
+
+typedef union MapSuperblock
+{
+	MapSuperblockData data;
+	char		padding[MAP_SUPERBLOCK_SIZE];
+} MapSuperblock;
+
+typedef char MapSuperblockDataSizeCheck
+[(sizeof(MapSuperblockData) == MAP_SUPERBLOCK_PAYLOAD_SIZE) ? 1 : -1];
+typedef char MapSuperblockDataCRCOffsetCheck
+[(offsetof(MapSuperblockData, crc) == 60) ? 1 : -1];
+typedef char MapSuperblockSizeCheck
+[(sizeof(MapSuperblock) == MAP_SUPERBLOCK_SIZE) ? 1 : -1];
+
+extern void MapSuperblockInit(MapSuperblock *super, uint32 flags);
+extern bool MapSuperblockHasValidIdentity(const MapSuperblock *super);
+extern bool MapSuperblockIsValid(const MapSuperblock *super);
+extern bool MapSuperblockCheckCRC(const MapSuperblock *super);
+extern void MapSuperblockRefreshCRC(MapSuperblock *super);
+
+extern void MapSuperblockSetFlags(MapSuperblock *super, uint32 flags);
+extern uint32 MapSuperblockGetFlags(const MapSuperblock *super);
+
+extern void MapSuperblockSetLastUpdatedLSN(MapSuperblock *super, XLogRecPtr lsn);
+extern XLogRecPtr MapSuperblockGetLastUpdatedLSN(const MapSuperblock *super);
+
+extern BlockNumber MapSuperblockGetNextFreePhysBlock(const MapSuperblock *super,
+													 ForkNumber forknum);
+extern void MapSuperblockSetNextFreePhysBlock(MapSuperblock *super,
+											  ForkNumber forknum,
+											  BlockNumber blkno);
+
+extern BlockNumber MapSuperblockGetPhysCapacity(const MapSuperblock *super,
+												ForkNumber forknum);
+extern void MapSuperblockSetPhysCapacity(MapSuperblock *super, ForkNumber forknum,
+										 BlockNumber blkno);
+
+extern BlockNumber MapSuperblockGetLogicalNblocks(const MapSuperblock *super,
+												  ForkNumber forknum);
+extern void MapSuperblockSetLogicalNblocks(MapSuperblock *super, ForkNumber forknum,
+										   BlockNumber nblocks);
+
+extern void MapSuperblockPackPage(const MapSuperblock *super, char page[BLCKSZ]);
+extern void MapSuperblockUnpackPage(MapSuperblock *super, const char page[BLCKSZ]);
+
+extern bool MapSBlockRead(SMgrRelation reln, MapSuperblock *super);
+extern void MapSBlockWrite(SMgrRelation reln, const MapSuperblock *super,
+						   bool skipFsync);
+extern void MapSBlockInitNew(SMgrRelation reln, uint32 flags, XLogRecPtr lsn,
+							 bool skipFsync);
+
+#endif							/* MAPSUPER_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1076717b92..8d06d69b51 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -113,6 +113,12 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
+extern void smgrcreaterelationmetadata(SMgrRelation reln);
+extern void smgrcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
+									 char relpersistence);
+extern void smgrsyncrelationmetadata(SMgrRelation reln);
+extern void smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator,
+									   bool isRedo);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *old_nblocks,
 						 BlockNumber *nblocks);
diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
index 30e033fcf0..2fb3c2f75e 100644
--- a/src/include/storage/umbra.h
+++ b/src/include/storage/umbra.h
@@ -34,6 +34,13 @@ extern void uminit(void);
 extern void umopen(SMgrRelation reln);
 extern void umclose(SMgrRelation reln, ForkNumber forknum);
 extern void umdestroy(SMgrRelation reln);
+extern bool umisinternalfork(ForkNumber forknum);
+extern void umcreaterelationmetadata(SMgrRelation reln);
+extern void umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
+								   char relpersistence);
+extern void umsyncrelationmetadata(SMgrRelation reln);
+extern void umunlinkrelationmetadata(RelFileLocatorBackend rlocator,
+									 bool isRedo);
 extern void umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool umexists(SMgrRelation reln, ForkNumber forknum);
 extern void umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 05/10] umbra: add patch 4 shared-memory MAP cache and checkpoint flush

---
src/backend/access/transam/xlog.c | 6 +
src/backend/commands/dbcommands.c | 19 +
src/backend/storage/map/Makefile | 4 +
src/backend/storage/map/map.c | 1047 ++++++-
src/backend/storage/map/mapbuf.c | 414 +++
src/backend/storage/map/mapclock.c | 457 +++
src/backend/storage/map/mapflush.c | 665 ++++
src/backend/storage/map/mapinit.c | 143 +
src/backend/storage/map/mapsuper.c | 1259 +++++++-
src/backend/storage/map/meson.build | 4 +
src/backend/storage/smgr/smgr.c | 52 +-
src/backend/storage/smgr/umbra.c | 339 ++-
src/backend/storage/smgr/umfile.c | 2700 ++++++++++++-----
src/backend/storage/sync/sync.c | 12 +-
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/postinit.c | 8 +-
src/include/storage/lwlocklist.h | 1 +
src/include/storage/map.h | 247 +-
src/include/storage/map_internal.h | 28 +
src/include/storage/mapsuper.h | 28 +-
src/include/storage/mapsuper_internal.h | 157 +
src/include/storage/smgr.h | 6 +
src/include/storage/subsystemlist.h | 3 +
src/include/storage/sync.h | 3 +
src/include/storage/umbra.h | 12 +
src/include/storage/umfile.h | 118 +-
src/test/recovery/meson.build | 3 +
.../t/053_umbra_map_superblock_watermark.pl | 104 +
.../recovery/t/054_umbra_map_fork_policy.pl | 62 +
...3_umbra_mainfork_head_unlink_checkpoint.pl | 60 +
30 files changed, 6991 insertions(+), 971 deletions(-)
create mode 100644 src/backend/storage/map/mapbuf.c
create mode 100644 src/backend/storage/map/mapclock.c
create mode 100644 src/backend/storage/map/mapflush.c
create mode 100644 src/backend/storage/map/mapinit.c
create mode 100644 src/include/storage/map_internal.h
create mode 100644 src/include/storage/mapsuper_internal.h
create mode 100644 src/test/recovery/t/053_umbra_map_superblock_watermark.pl
create mode 100644 src/test/recovery/t/054_umbra_map_fork_policy.pl
create mode 100644 src/test/recovery/t/063_umbra_mainfork_head_unlink_checkpoint.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f85b528608..d1bf13b951 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -98,6 +98,9 @@
 #include "storage/spin.h"
 #include "storage/subsystems.h"
 #include "storage/sync.h"
+#ifdef USE_UMBRA
+#include "storage/map.h"
+#endif
 #include "utils/guc_hooks.h"
 #include "utils/guc_tables.h"
 #include "utils/injection_point.h"
@@ -8062,6 +8065,9 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
+#ifdef USE_UMBRA
+	MapCheckpoint();
+#endif
 	CheckPointBuffers(flags);

 	/* Perform all queued up fsyncs */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f0819d15ab..8751886bb6 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1059,6 +1059,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errhint("Valid strategies are \"wal_log\" and \"file_copy\".")));
 	}

+	/*
+	 * Umbra currently supports only the legacy file-copy CREATE DATABASE copy
+	 * semantics.
+	 *
+	 * Accept STRATEGY = WAL_LOG for compatibility, but route through the
+	 * file-copy path until Umbra owns a WAL_LOG implementation.
+	 */
+	if (dbstrategy == CREATEDB_WAL_LOG &&
+		!smgrcreatedballowswallog())
+		dbstrategy = CREATEDB_FILE_COPY;
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -1873,6 +1884,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * dirty buffer to the dead database later...
 	 */
 	DropDatabaseBuffers(db_id);
+	smgrinvalidatedatabase(db_id);

 	/*
 	 * Tell checkpointer to forget any pending fsync and unlink requests for
@@ -2164,6 +2176,7 @@ movedb(const char *dbname, const char *tblspcname)
 	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
 	 */
 	DropDatabaseBuffers(db_id);
+	smgrinvalidatedatabasetablespaces(db_id, 1, &src_tblspcoid);

 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
@@ -3370,9 +3383,12 @@ dbase_redo(XLogReaderState *record)
 		 * up-to-date for the copy.
 		 */
 		FlushDatabaseBuffers(xlrec->src_db_id);
+		smgrcheckpointdatabasetablespaces(xlrec->src_db_id, 1,
+										  &xlrec->src_tablespace_id);

/* Close all smgr fds in all backends. */
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
+ smgrreleaseall();

/*
* Copy this subdirectory to the new location
@@ -3431,6 +3447,8 @@ dbase_redo(XLogReaderState *record)

 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);
+		smgrinvalidatedatabasetablespaces(xlrec->db_id, xlrec->ntablespaces,
+										  xlrec->tablespace_ids);

/* Also, clean out any fsync requests that might be pending in md.c */
ForgetDatabaseSyncRequests(xlrec->db_id);
@@ -3440,6 +3458,7 @@ dbase_redo(XLogReaderState *record)

/* Close all smgr fds in all backends. */
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
+ smgrreleaseall();

 		for (i = 0; i < xlrec->ntablespaces; i++)
 		{
diff --git a/src/backend/storage/map/Makefile b/src/backend/storage/map/Makefile
index ee9603de14..08c3b69679 100644
--- a/src/backend/storage/map/Makefile
+++ b/src/backend/storage/map/Makefile
@@ -14,6 +14,10 @@ include $(top_builddir)/src/Makefile.global

 OBJS = \
 	map.o \
+	mapinit.o \
+	mapbuf.o \
+	mapflush.o \
+	mapclock.o \
 	mapsuper.o

 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
index 563f38b21a..1c74aa94ef 100644
--- a/src/backend/storage/map/map.c
+++ b/src/backend/storage/map/map.c
@@ -1,10 +1,10 @@
 /*-------------------------------------------------------------------------
  *
  * map.c
- *	  Umbra metadata-fork disk layout helpers.
+ *	  physical map layer implementation
  *
- * This file contains address-translation and in-page access routines for the
- * metadata fork disk layout.
+ * This module owns MAP metadata layout helpers and shared cache/checkpoint
+ * support for MAP pages.
  *
  * src/backend/storage/map/map.c
  *
@@ -12,47 +12,155 @@
  */
 #include "postgres.h"

+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "common/hashfn.h"
+#include "common/relpath.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
 #include "storage/map.h"
-#include "storage/um_defs.h"
+#include "storage/map_internal.h"
+#include "storage/mapsuper.h"
+#include "storage/mapsuper_internal.h"
+#include "storage/procnumber.h"
+#include "storage/shmem.h"
+#include "storage/sync.h"
+#include "storage/umfile.h"
+#include "utils/memutils.h"

-void
-MapPageInit(MapPage *page)
+typedef struct MapTruncatePreloadState
+{
+	bool			active;
+	RelFileLocator	rnode;
+	ForkNumber		forknum;
+	int				nslots;
+	int				capacity;
+	int			   *slots;
+} MapTruncatePreloadState;
+
+static MapTruncatePreloadState MapTruncatePreload[MAX_FORKNUM + 1];
+
+
+typedef enum MapCachedLookupResult
 {
-	Assert(page != NULL);
+	MAP_CACHED_LOOKUP_MISS,
+	MAP_CACHED_LOOKUP_UNMAPPED,
+	MAP_CACHED_LOOKUP_MAPPED
+} MapCachedLookupResult;
+
+/* Internal functions */
+static bool MapTablespaceSelected(Oid spcOid, int ntablespaces,
+								  const Oid *tablespace_ids);
+static bool MapTruncateEntryRange(ForkNumber forknum, BlockNumber n_lblknos,
+								  BlockNumber n_map_pages,
+								  BlockNumber *start_map_page,
+								  int *start_entry_idx,
+								  BlockNumber *end_map_page,
+								  int *end_entry_idx);
+
+static void
+MapTruncatePreloadResetEntry(MapTruncatePreloadState *state)
+{
+	int i;
+
+	if (!state->active)
+		return;
+
+	for (i = 0; i < state->nslots; i++)
+		MapUnpinBuffer(state->slots[i]);

-	MemSet(page->pblknos, 0xFF, sizeof(page->pblknos));
+	state->active = false;
+	state->nslots = 0;
+	state->forknum = InvalidForkNumber;
+	memset(&state->rnode, 0, sizeof(state->rnode));
 }

-BlockNumber
-MapPageGetEntry(const MapPage *page, int entry_idx)
+static MapTruncatePreloadState *
+MapTruncatePreloadEntry(RelFileLocator rnode, ForkNumber forknum)
 {
-	Assert(page != NULL);
+	MapTruncatePreloadState *state;
+
+	Assert(forknum >= 0 && forknum <= MAX_FORKNUM);
+	state = &MapTruncatePreload[forknum];

-	if (entry_idx < 0 || entry_idx >= MAP_ENTRIES_PER_PAGE)
-		elog(ERROR, "map entry index %d is out of range", entry_idx);
+	if (state->active &&
+		(!RelFileLocatorEquals(state->rnode, rnode) ||
+		 state->forknum != forknum))
+		MapTruncatePreloadResetEntry(state);

-	return page->pblknos[entry_idx];
+	return state;
 }

+BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
+									   BlockNumber fork_page_idx);
+BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
+static bool MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,
+							  BlockNumber *fork_page_idx);
+static bool MapMapPageWithinLogicalRange(UmbraFileContext *map_ctx,
+										 RelFileLocator rnode,
+										 ForkNumber forknum,
+										 BlockNumber map_blkno);
+static MapCachedLookupResult MapTryLookupCachedEntry(RelFileLocator rnode,
+													 ForkNumber forknum,
+													 BlockNumber map_blkno,
+													 int entry_idx,
+													 bool adjust_usage,
+													 BlockNumber *pblkno);
+static MapCachedLookupResult MapTryLookupCachedPblknoInternal(RelFileLocator rnode,
+															  ForkNumber forknum,
+															  BlockNumber lblkno,
+															  bool adjust_usage,
+															  BlockNumber *pblkno);
 void
-MapPageSetEntry(MapPage *page, int entry_idx, BlockNumber pblkno)
+MapResetAllTruncatePreloads(void)
+{
+	int slot_id;
+
+	for (slot_id = 0; slot_id <= MAX_FORKNUM; slot_id++)
+	{
+		MapTruncatePreload[slot_id].active = false;
+		MapTruncatePreload[slot_id].nslots = 0;
+	}
+}
+
+
+static bool
+MapTruncateEntryRange(ForkNumber forknum, BlockNumber n_lblknos,
+					  BlockNumber old_n_lblknos,
+					  BlockNumber *start_map_page,
+					  int *start_entry_idx,
+					  BlockNumber *end_map_page,
+					  int *end_entry_idx)
 {
-	Assert(page != NULL);
+	BlockNumber	start_page_idx;
+	BlockNumber	end_page_idx;

-	if (entry_idx < 0 || entry_idx >= MAP_ENTRIES_PER_PAGE)
-		elog(ERROR, "map entry index %d is out of range", entry_idx);
+	(void) forknum;

-	page->pblknos[entry_idx] = pblkno;
+	if (old_n_lblknos <= n_lblknos)
+		return false;
+
+	start_page_idx = n_lblknos / MAP_ENTRIES_PER_PAGE;
+	end_page_idx = (old_n_lblknos - 1) / MAP_ENTRIES_PER_PAGE;
+
+	*start_map_page = start_page_idx;
+	*start_entry_idx = n_lblknos % MAP_ENTRIES_PER_PAGE;
+	*end_map_page = end_page_idx;
+	*end_entry_idx = (old_n_lblknos - 1) % MAP_ENTRIES_PER_PAGE;
+	return true;
 }

 BlockNumber
 MapForkPageIndexToMapBlkno(ForkNumber forknum, BlockNumber fork_page_idx)
 {
-	uint64		group_no;
-	uint64		blkno64;
+	uint64 group_no;
+	uint64 blkno64;

 	if (forknum == UMBRA_METADATA_FORKNUM)
-		elog(ERROR, "Umbra metadata fork cannot be addressed as a map target");
+		elog(ERROR, "Umbra metadata fork should not call MapForkPageIndexToMapBlkno");

switch (forknum)
{
@@ -71,8 +179,7 @@ MapForkPageIndexToMapBlkno(ForkNumber forknum, BlockNumber fork_page_idx)

 		case MAIN_FORKNUM:
 		{
-			uint64		group_page_idx = (uint64) fork_page_idx;
-
+			uint64 group_page_idx = (uint64) fork_page_idx;
 			group_no = group_page_idx / (uint64) MAP_GROUP_MAIN_PAGES;
 			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
 				group_no * (uint64) MAP_GROUP_TOTAL_PAGES +
@@ -83,19 +190,26 @@ MapForkPageIndexToMapBlkno(ForkNumber forknum, BlockNumber fork_page_idx)
 		}

 		default:
-			elog(ERROR, "unsupported fork number %d in map layout", (int) forknum);
-			pg_unreachable();
+			elog(ERROR, "unsupported fork number %d in map lookup", (int) forknum);
+			return 0;
 	}

 	if (blkno64 > (uint64) MaxBlockNumber)
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("cannot address map page %u for fork %d",
+				 errmsg("cannot address map page %u for fork %d in MAP",
 						fork_page_idx, forknum)));

return (BlockNumber) blkno64;
}

+/*
+ * MapLblknoToMapBlkno - convert (forknum, lblkno) to linear MAP entry index.
+ *
+ * The metadata fork stores repeated proportional groups:
+ * [FSM page][VM page][8192 MAIN pages].
+ * Each fork page still maps MAP_ENTRIES_PER_PAGE logical blocks.
+ */
 BlockNumber
 MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno)
 {
@@ -110,22 +224,19 @@ MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno)
 	if (entry64 > (uint64) MaxBlockNumber)
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("cannot address logical block %u for fork %d in map",
+				 errmsg("cannot address logical block %u for fork %d in MAP",
 						lblkno, forknum)));

return (BlockNumber) entry64;
}

-bool
+static bool
 MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,
 				  BlockNumber *fork_page_idx)
 {
-	uint64		offset;
-	uint64		group_no;
-	uint64		in_group;
-
-	Assert(forknum != NULL);
-	Assert(fork_page_idx != NULL);
+	uint64 offset;
+	uint64 group_no;
+	uint64 in_group;

if (map_blkno == MAP_BLOCK_SUPER || map_blkno < MAP_BLOCK_FIRST_GROUP)
return false;
@@ -160,3 +271,869 @@ MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,

 	return false;
 }
+
+/*
+ * MapMapPageWithinLogicalRange - whether a MAP page intersects current logical
+ * mapping domain of the target fork.
+ *
+ * This check is superblock-driven and keeps sparse holes outside logical range
+ * from being interpreted as real MAP pages.
+ */
+static bool
+MapMapPageWithinLogicalRange(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							 ForkNumber forknum, BlockNumber map_blkno)
+{
+	BlockNumber n_lblknos;
+	ForkNumber	page_forknum;
+	BlockNumber	page_idx;
+	uint64		page_first_lblk;
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &n_lblknos))
+		return true;
+
+	if (!MapDecodeMapBlkno(map_blkno, &page_forknum, &page_idx))
+		return false;
+
+	if (page_forknum != forknum)
+		return false;
+
+	page_first_lblk = (uint64) page_idx * (uint64) MAP_ENTRIES_PER_PAGE;
+	if (page_first_lblk >= (uint64) n_lblknos)
+		return false;
+
+	return true;
+}
+
+/*
+ * MapTryLookupCachedEntry - read a cached MAP entry without performing I/O.
+ *
+ * The caller supplies the decoded MAP page and entry index so cache hit
+ * handling stays in one place.  Result distinguishes between:
+ * - cache miss / stale slot
+ * - cached page with an unmapped entry
+ * - cached page with a valid mapping
+ */
+static MapCachedLookupResult
+MapTryLookupCachedEntry(RelFileLocator rnode, ForkNumber forknum,
+						  BlockNumber map_blkno, int entry_idx,
+						  bool adjust_usage, BlockNumber *pblkno)
+{
+	int				slot_id;
+	MapBufferDesc   *buf;
+	MapPage		   *page;
+	BlockNumber		value;
+
+	slot_id = MapCacheLookup(rnode, forknum, map_blkno);
+	if (slot_id < 0)
+		return MAP_CACHED_LOOKUP_MISS;
+
+	buf = &MapBuffers[slot_id];
+	MapPinBuffer(slot_id, adjust_usage);
+	LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+
+	if (buf->page_number != map_blkno ||
+		buf->page_number < 0 ||
+		!RelFileLocatorEquals(buf->rnode, rnode) ||
+		buf->forknum != forknum)
+	{
+		LWLockRelease(&buf->buffer_lock);
+		MapUnpinBuffer(slot_id);
+		return MAP_CACHED_LOOKUP_MISS;
+	}
+
+	page = MapGetPage(slot_id);
+	value = page->pblknos[entry_idx];
+	LWLockRelease(&buf->buffer_lock);
+	MapUnpinBuffer(slot_id);
+
+	if (value == InvalidBlockNumber)
+		return MAP_CACHED_LOOKUP_UNMAPPED;
+
+	*pblkno = value;
+	return MAP_CACHED_LOOKUP_MAPPED;
+}
+
+static MapCachedLookupResult
+MapTryLookupCachedPblknoInternal(RelFileLocator rnode, ForkNumber forknum,
+								   BlockNumber lblkno, bool adjust_usage,
+								   BlockNumber *pblkno)
+{
+	BlockNumber		map_blkno;
+	int				entry_idx;
+
+	Assert(pblkno != NULL);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		return MAP_CACHED_LOOKUP_MISS;
+
+	map_blkno = MapLblknoToMapBlkno(forknum, lblkno);
+	entry_idx = map_blkno % MAP_ENTRIES_PER_PAGE;
+	map_blkno = map_blkno / MAP_ENTRIES_PER_PAGE;
+
+	return MapTryLookupCachedEntry(rnode, forknum, map_blkno, entry_idx,
+								   adjust_usage, pblkno);
+}
+
+/*
+ * MapTryLookup - try to find physical block number for a logical block.
+ *
+ * Returns true and sets *pblkno when a valid mapping exists.
+ * Returns false if MAP fork is absent or the entry is still unmapped.
+ */
+bool
+MapTryLookup(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber forknum,
+			 BlockNumber lblkno, BlockNumber *pblkno)
+{
+	BlockNumber	map_blkno;
+	int			slot_id;
+	uint32_t	state;
+	MapPage    *page;
+	MapBufferDesc *buf;
+	int			entry_idx;
+	MapCachedLookupResult cache_result;
+
+	Assert(pblkno != NULL);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		elog(ERROR, "MapTryLookup does not accept Umbra metadata fork");
+
+	cache_result = MapTryLookupCachedPblknoInternal(rnode, forknum, lblkno,
+													 true, pblkno);
+	if (cache_result != MAP_CACHED_LOOKUP_MISS)
+		return cache_result == MAP_CACHED_LOOKUP_MAPPED;
+
+	/* Convert (forknum, lblkno) to MAP page and entry index */
+	map_blkno = MapLblknoToMapBlkno(forknum, lblkno);
+	entry_idx = map_blkno % MAP_ENTRIES_PER_PAGE;
+	map_blkno = map_blkno / MAP_ENTRIES_PER_PAGE;
+
+	/* Find or load the map page - returns with buffer pinned */
+	slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+	buf = &MapBuffers[slot_id];
+	page = MapGetPage(slot_id);
+
+	/* Verify buffer is pinned */
+	state = pg_atomic_read_u32(&buf->state);
+	if (!(state & MAPBUF_VALID_MASK))
+		elog(ERROR, "map buffer not pinned");
+
+	LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+	*pblkno = page->pblknos[entry_idx];
+	LWLockRelease(&buf->buffer_lock);
+	MapUnpinBuffer(slot_id);
+
+	return (*pblkno != InvalidBlockNumber);
+}
+
+/*
+ * MapTryLookupPblkRun - find the longest contiguous mapped pblk run.
+ *
+ * Returns the number of blocks in the run beginning at lblkno, up to
+ * maxblocks. Returns 0 if the first entry is unmapped.
+ *
+ * This batches translation by MAP page, so callers don't need a full
+ * MapTryLookup() round trip for every block in a contiguous run.
+ */
+BlockNumber
+MapTryLookupPblkRun(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					ForkNumber forknum, BlockNumber lblkno,
+					BlockNumber maxblocks, BlockNumber *start_pblkno)
+{
+	BlockNumber current_lblk = lblkno;
+	BlockNumber remaining = maxblocks;
+	BlockNumber run_blocks = 0;
+	BlockNumber expected_next_pblk = InvalidBlockNumber;
+	BlockNumber current_map_blkno = InvalidBlockNumber;
+	int			current_slot = -1;
+
+	Assert(start_pblkno != NULL);
+	Assert(maxblocks > 0);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		elog(ERROR, "MapTryLookupPblkRun does not accept Umbra metadata fork");
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return 0;
+
+	while (remaining > 0)
+	{
+		BlockNumber	map_entry_no;
+		BlockNumber	map_blkno;
+		int			entry_idx;
+		int			entries_this_page;
+		MapBufferDesc *buf;
+		MapPage	   *page;
+
+		map_entry_no = MapLblknoToMapBlkno(forknum, current_lblk);
+		entry_idx = map_entry_no % MAP_ENTRIES_PER_PAGE;
+		map_blkno = map_entry_no / MAP_ENTRIES_PER_PAGE;
+		entries_this_page = Min((BlockNumber) (MAP_ENTRIES_PER_PAGE - entry_idx),
+								remaining);
+
+		if (current_slot < 0 || current_map_blkno != map_blkno)
+		{
+			if (current_slot >= 0)
+				MapUnpinBuffer(current_slot);
+			current_slot = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+			current_map_blkno = map_blkno;
+		}
+
+		buf = &MapBuffers[current_slot];
+		page = MapGetPage(current_slot);
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		for (int i = 0; i < entries_this_page; i++)
+		{
+			BlockNumber	pblkno = page->pblknos[entry_idx + i];
+
+			if (pblkno == InvalidBlockNumber)
+			{
+				LWLockRelease(&buf->buffer_lock);
+				goto done;
+			}
+
+			if (run_blocks == 0)
+			{
+				*start_pblkno = pblkno;
+				expected_next_pblk = pblkno + 1;
+				run_blocks = 1;
+				current_lblk++;
+				remaining--;
+				continue;
+			}
+
+			if (pblkno != expected_next_pblk)
+			{
+				LWLockRelease(&buf->buffer_lock);
+				goto done;
+			}
+
+			if (((*start_pblkno % ((BlockNumber) RELSEG_SIZE)) + run_blocks) >=
+				((BlockNumber) RELSEG_SIZE))
+			{
+				LWLockRelease(&buf->buffer_lock);
+				goto done;
+			}
+
+			expected_next_pblk++;
+			run_blocks++;
+			current_lblk++;
+			remaining--;
+		}
+		LWLockRelease(&buf->buffer_lock);
+	}
+
+done:
+	if (current_slot >= 0)
+		MapUnpinBuffer(current_slot);
+
+	return run_blocks;
+}
+
+
+/*
+ * MapReadBuffer - read a map page into buffer
+ *
+ * Returns the slot_id of the buffer, with the buffer pinned.
+ *
+ * The caller owns the returned buffer pin.
+ */
+int
+MapReadBuffer(UmbraFileContext *map_ctx, RelFileLocator rnode,
+			  ForkNumber forknum, BlockNumber map_blkno)
+{
+	int			slot_id;
+	uint32_t	state;
+	MapPage    *page;
+	MapBufferDesc *buf;
+	BlockNumber map_nblocks;
+	int			old_page_number;
+	ForkNumber	old_forknum;
+	RelFileLocator old_rnode;
+
+	if (map_blkno == MAP_BLOCK_SUPER)
+		elog(ERROR, "MapReadBuffer cannot be used for MAP superblock");
+
+	for (;;)
+	{
+		int			existing_slot_id;
+		bool		retry = false;
+
+		slot_id = MapCacheLookup(rnode, forknum, map_blkno);
+		if (slot_id >= 0)
+		{
+			buf = &MapBuffers[slot_id];
+
+			MapPinBuffer(slot_id, true);
+			LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+
+			if (buf->page_number == map_blkno &&
+				buf->page_number >= 0 &&
+				RelFileLocatorEquals(buf->rnode, rnode) &&
+				buf->forknum == forknum)
+			{
+				LWLockRelease(&buf->buffer_lock);
+				return slot_id;
+			}
+
+			LWLockRelease(&buf->buffer_lock);
+			MapUnpinBuffer(slot_id);
+			continue;
+		}
+
+		slot_id = MapClockGetBuffer();
+		buf = &MapBuffers[slot_id];
+		MapPinBuffer(slot_id, false);
+
+		LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+
+		if (buf->page_number == map_blkno &&
+			buf->page_number >= 0 &&
+			RelFileLocatorEquals(buf->rnode, rnode) &&
+			buf->forknum == forknum)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			return slot_id;
+		}
+
+		state = pg_atomic_read_u32(&buf->state);
+		if (MAPBUF_GET_REFCOUNT(state) != 1)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			MapUnpinBuffer(slot_id);
+			continue;
+		}
+
+		if (state & MAPBUF_DIRTY)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			MapFlushBuffer(slot_id);
+
+			LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+			if (buf->page_number == map_blkno &&
+				buf->page_number >= 0 &&
+				RelFileLocatorEquals(buf->rnode, rnode) &&
+				buf->forknum == forknum)
+			{
+				LWLockRelease(&buf->buffer_lock);
+				return slot_id;
+			}
+
+			state = pg_atomic_read_u32(&buf->state);
+			if (MAPBUF_GET_REFCOUNT(state) != 1 ||
+				(state & MAPBUF_DIRTY))
+			{
+				LWLockRelease(&buf->buffer_lock);
+				MapUnpinBuffer(slot_id);
+				continue;
+			}
+		}
+		old_page_number = buf->page_number;
+		old_forknum = buf->forknum;
+		old_rnode = buf->rnode;
+		existing_slot_id = MapCacheInsert(rnode, forknum, map_blkno, slot_id);
+		if (existing_slot_id >= 0 && existing_slot_id != slot_id)
+			retry = true;
+		if (retry)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			MapUnpinBuffer(slot_id);
+			continue;
+		}
+
+		if (old_page_number >= 0)
+			MapCacheDelete(old_rnode, old_forknum,
+						   (BlockNumber) old_page_number, slot_id);
+
+		buf->page_number = map_blkno;
+		buf->rnode = rnode;
+		buf->forknum = forknum;
+		buf->page_lsn = 0;
+		MapBufferUpdateStateBits(buf, MAPBUF_USAGECOUNT_ONE, 0);
+
+		page = MapGetPage(slot_id);
+		if (umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								   UMFILE_EXISTS_DENSE))
+		{
+			map_nblocks = umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+												 UMFILE_NBLOCKS_DENSE);
+			if (map_blkno < map_nblocks &&
+				MapMapPageWithinLogicalRange(map_ctx, rnode, forknum, map_blkno))
+			{
+				umfile_ctx_read(map_ctx, UMBRA_METADATA_FORKNUM, map_blkno,
+								(char *) page, BLCKSZ);
+				MapBufferUpdateStateBits(buf, 0, MAPBUF_NOT_MATERIALIZED);
+
+				if (pg_memory_is_all_zeros(page, BLCKSZ))
+				{
+					BlockNumber	n_lblknos = 0;
+					ForkNumber	page_forknum;
+					BlockNumber	page_idx;
+					bool		need_this_page = false;
+
+					if (MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum,
+													 &n_lblknos) &&
+						n_lblknos > 0 &&
+						MapDecodeMapBlkno(map_blkno, &page_forknum, &page_idx) &&
+						page_forknum == forknum)
+					{
+						uint64 page_first_lblk =
+							(uint64) page_idx * (uint64) MAP_ENTRIES_PER_PAGE;
+
+						need_this_page = page_first_lblk < (uint64) n_lblknos;
+					}
+
+					if (need_this_page)
+						ereport(ERROR,
+								(errcode(ERRCODE_DATA_CORRUPTED),
+								 errmsg("MAP page %u is all-zeros for relation %u/%u/%u fork %d",
+										map_blkno, rnode.spcOid, rnode.dbOid,
+										rnode.relNumber, forknum)));
+
+					MemSet(page, 0xFF, BLCKSZ);
+				}
+			}
+			else
+			{
+				MemSet(page, 0xFF, BLCKSZ);
+				if (map_blkno >= map_nblocks)
+					MapBufferUpdateStateBits(buf, MAPBUF_NOT_MATERIALIZED, 0);
+				else
+					MapBufferUpdateStateBits(buf, 0, MAPBUF_NOT_MATERIALIZED);
+			}
+		}
+		else
+		{
+			MemSet(page, 0xFF, BLCKSZ);
+			MapBufferUpdateStateBits(buf, MAPBUF_NOT_MATERIALIZED, 0);
+		}
+
+		LWLockRelease(&buf->buffer_lock);
+		return slot_id;
+	}
+}
+
+/*
+ * MapDrop - drop mapping for a relation
+ */
+void
+MapDrop(RelFileLocator rnode)
+{
+	RelFileLocatorBackend rnode_backend;
+
+	rnode_backend.locator = rnode;
+	rnode_backend.backend = INVALID_PROC_NUMBER;
+
+	MapInvalidateRelation(rnode);
+	umfile_ctx_unlinkfork(rnode_backend, UMBRA_METADATA_FORKNUM, false);
+}
+
+/*
+ * MapTruncate - truncate mapping when relation is truncated
+ */
+void
+MapTruncate(UmbraFileContext *map_ctx, RelFileLocator rnode,
+			ForkNumber forknum, BlockNumber n_lblknos,
+			XLogRecPtr map_lsn)
+{
+	BlockNumber old_n_lblknos = 0;
+	BlockNumber end_page_idx;
+	BlockNumber start_page_idx;
+	int         start_entry_idx;
+	int         end_entry_idx;
+	BlockNumber page_idx;
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		return;
+
+	Assert(map_ctx != NULL);
+	Assert(map_lsn != InvalidXLogRecPtr);
+	if (map_lsn == InvalidXLogRecPtr)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid truncate WAL LSN for relation %u/%u/%u fork %d",
+						rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum),
+				 errdetail("truncate target logical block count: %u", n_lblknos)));
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+	{
+		Assert(false);
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("required MAP fork is missing during truncate for relation %u/%u/%u fork %d",
+						rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum),
+				 errdetail("truncate target logical block count: %u", n_lblknos)));
+	}
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &old_n_lblknos))
+		return;
+
+	if (!MapTruncateEntryRange(forknum, n_lblknos, old_n_lblknos,
+							   &start_page_idx, &start_entry_idx,
+							   &end_page_idx, &end_entry_idx))
+		return;
+
+	for (page_idx = start_page_idx; page_idx <= end_page_idx; page_idx++)
+	{
+		int          slot_id;
+		int          begin_idx;
+		int          last_idx;
+		Size         clear_bytes;
+		BlockNumber  map_blkno;
+		MapPage     *page;
+		MapBufferDesc *buf;
+
+		map_blkno = MapForkPageIndexToMapBlkno(forknum, page_idx);
+		if (map_blkno >= umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+												UMFILE_NBLOCKS_DENSE))
+			break;
+
+		slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+		buf = &MapBuffers[slot_id];
+		page = MapGetPage(slot_id);
+
+		begin_idx = (page_idx == start_page_idx) ? start_entry_idx : 0;
+		last_idx = (page_idx == end_page_idx) ? end_entry_idx : (MAP_ENTRIES_PER_PAGE - 1);
+
+		LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+		if (begin_idx == 0 && last_idx == (MAP_ENTRIES_PER_PAGE - 1))
+		{
+			/* Fast path: the whole map page range is invalidated. */
+			MemSet(page->pblknos, 0xFF, MAP_ENTRIES_PER_PAGE * sizeof(uint32));
+		}
+		else
+		{
+			/* Boundary pages: invalidate only the requested subrange. */
+			clear_bytes = ((Size) (last_idx - begin_idx + 1)) * sizeof(uint32);
+			MemSet(&page->pblknos[begin_idx], 0xFF, clear_bytes);
+		}
+
+		/* Associate truncate-driven map rewrite with truncate WAL LSN. */
+		MapMarkBufferDirty(map_ctx, buf, map_lsn);
+
+		LWLockRelease(&buf->buffer_lock);
+
+		MapUnpinBuffer(slot_id);
+	}
+
+	/*
+	 * Keep dirty map pages in cache and let checkpoint/bgwriter flush them.
+	 * Invalidating relation slots here would clear dirty state before writeback.
+	 */
+}
+
+void
+MapPreloadTruncatePages(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						ForkNumber forknum, BlockNumber n_lblknos)
+{
+	MapTruncatePreloadState *state;
+	BlockNumber old_n_lblknos = 0;
+	BlockNumber start_page_idx;
+	BlockNumber end_page_idx;
+	int start_entry_idx;
+	int end_entry_idx;
+	BlockNumber page_idx;
+
+	if (forknum == UMBRA_METADATA_FORKNUM ||
+		!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return;
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &old_n_lblknos))
+		return;
+
+	if (!MapTruncateEntryRange(forknum, n_lblknos, old_n_lblknos,
+							   &start_page_idx, &start_entry_idx,
+							   &end_page_idx, &end_entry_idx))
+		return;
+
+	state = MapTruncatePreloadEntry(rnode, forknum);
+	MapTruncatePreloadResetEntry(state);
+
+	state->active = true;
+	state->rnode = rnode;
+	state->forknum = forknum;
+
+	for (page_idx = start_page_idx; page_idx <= end_page_idx; page_idx++)
+	{
+		int slot_id;
+		BlockNumber map_blkno;
+
+		if (state->nslots == state->capacity)
+		{
+			int newcap = state->capacity == 0 ? 4 : state->capacity * 2;
+
+			if (state->slots == NULL)
+				state->slots = MemoryContextAlloc(TopMemoryContext,
+												 sizeof(int) * newcap);
+			else
+				state->slots = repalloc(state->slots, sizeof(int) * newcap);
+			state->capacity = newcap;
+		}
+
+		map_blkno = MapForkPageIndexToMapBlkno(forknum, page_idx);
+		if (map_blkno >= umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+												UMFILE_NBLOCKS_DENSE))
+			break;
+
+		slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+		state->slots[state->nslots++] = slot_id;
+	}
+}
+
+void
+MapReleasePreloadedTruncatePages(RelFileLocator rnode, ForkNumber forknum)
+{
+	MapTruncatePreloadState *state;
+
+	Assert(forknum >= 0 && forknum <= MAX_FORKNUM);
+	state = &MapTruncatePreload[forknum];
+
+	if (!state->active)
+		return;
+
+	if (!RelFileLocatorEquals(state->rnode, rnode) || state->forknum != forknum)
+		return;
+
+	MapTruncatePreloadResetEntry(state);
+}
+
+/*
+ * MapInvalidateRelation - invalidate all map cache entries for one relation.
+ */
+void
+MapInvalidateRelation(RelFileLocator rnode)
+{
+	int			slot_id;
+
+	for (slot_id = 0; slot_id < map_buffers; slot_id++)
+	{
+		MapBufferDesc *buf = &MapBuffers[slot_id];
+		int			page_number;
+		ForkNumber	forknum;
+		RelFileLocator slot_rnode;
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		page_number = buf->page_number;
+		forknum = buf->forknum;
+		slot_rnode = buf->rnode;
+		LWLockRelease(&buf->buffer_lock);
+
+		if (page_number < 0 || !RelFileLocatorEquals(slot_rnode, rnode))
+			continue;
+
+		MapCacheDelete(slot_rnode, forknum, (BlockNumber) page_number, slot_id);
+		MapInvalidateBuffer(slot_id, slot_rnode, forknum,
+							(BlockNumber) page_number);
+	}
+
+	/* Remove dedicated superblock cache entry for this relation. */
+	MapSuperDeleteEntry(rnode);
+}
+
+static bool
+MapTablespaceSelected(Oid spcOid, int ntablespaces, const Oid *tablespace_ids)
+{
+	int			i;
+
+	if (ntablespaces <= 0 || tablespace_ids == NULL)
+		return true;
+
+	for (i = 0; i < ntablespaces; i++)
+	{
+		if (tablespace_ids[i] == spcOid)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * MapInvalidateDatabaseTablespaces - invalidate MAP metadata/cache for a DB.
+ *
+ * If ntablespaces<=0, invalidate all tablespaces of that DB.
+ * If ntablespaces>0, only invalidate entries whose spcOid is in the list.
+ *
+ * This is needed because database OIDs and relfilenodes can be reused after
+ * DROP/CREATE churn. Without DB-scope invalidation, stale MAP buffer/cache/
+ * super entries can survive and be incorrectly reused by relations in the
+ * recreated DB.
+ */
+void
+MapInvalidateDatabaseTablespaces(Oid dbid, int ntablespaces,
+								 const Oid *tablespace_ids)
+{
+	int			slot_id;
+
+	/* Invalidate per-buffer cached pages */
+	for (slot_id = 0; slot_id < map_buffers; slot_id++)
+	{
+		MapBufferDesc *buf = &MapBuffers[slot_id];
+		int			page_number;
+		ForkNumber	forknum;
+		RelFileLocator slot_rnode;
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		page_number = buf->page_number;
+		forknum = buf->forknum;
+		slot_rnode = buf->rnode;
+		LWLockRelease(&buf->buffer_lock);
+
+		if (page_number < 0 ||
+			slot_rnode.dbOid != dbid ||
+			!MapTablespaceSelected(slot_rnode.spcOid, ntablespaces, tablespace_ids))
+			continue;
+
+		MapCacheDelete(slot_rnode, forknum, (BlockNumber) page_number, slot_id);
+		MapInvalidateBuffer(slot_id, slot_rnode, forknum,
+							(BlockNumber) page_number);
+	}
+
+	/* Invalidate dedicated superblock cache entries for matching relations */
+	{
+		RelFileLocator *targets;
+		int			target_cap = 256;
+		int			target_count = 0;
+		int			i;
+
+		targets = palloc(sizeof(RelFileLocator) * target_cap);
+
+		for (slot_id = 0; slot_id < MapSuperCapacity; slot_id++)
+		{
+			MapSuperEntry *entry = MapSuperEntryBySlot(slot_id);
+			RelFileLocator rnode;
+
+			LWLockAcquire(&entry->lock, LW_SHARED);
+			if (!entry->in_use ||
+				entry->key.rnode.dbOid != dbid ||
+				!MapTablespaceSelected(entry->key.rnode.spcOid, ntablespaces,
+									   tablespace_ids))
+			{
+				LWLockRelease(&entry->lock);
+				continue;
+			}
+
+			rnode = entry->key.rnode;
+			LWLockRelease(&entry->lock);
+
+			if (target_count >= target_cap)
+			{
+				target_cap *= 2;
+				targets = repalloc(targets, sizeof(RelFileLocator) * target_cap);
+			}
+			targets[target_count++] = rnode;
+		}
+
+		for (i = 0; i < target_count; i++)
+			MapSuperDeleteEntry(targets[i]);
+
+		pfree(targets);
+	}
+}
+
+/*
+ * MapInvalidateDatabase - invalidate all MAP metadata/cache for one database.
+ */
+void
+MapInvalidateDatabase(Oid dbid)
+{
+	MapInvalidateDatabaseTablespaces(dbid, 0, NULL);
+}
+
+/*
+ * MapGetLogicalBlockCount - return the persisted logical block count.
+ */
+BlockNumber
+MapGetLogicalBlockCount(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber forknum)
+{
+	BlockNumber n_lblknos = 0;
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &n_lblknos))
+		return 0;
+
+	return n_lblknos;
+}
+
+/*
+ * MapGetPhysicalBlockCount - physical block count needed for first n lblknos
+ *
+ * Returns max(mapped pblkno in [0, n_lblknos)) + 1.
+ * This is used by truncate to avoid cutting off still-referenced physical
+ * blocks when logical->physical mapping is non-identity.
+ */
+BlockNumber
+MapGetPhysicalBlockCount(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						 ForkNumber forknum, BlockNumber n_lblknos)
+{
+	BlockNumber n_map_pages;
+	BlockNumber current_page = InvalidBlockNumber;
+	BlockNumber page_idx;
+	BlockNumber page_count;
+	BlockNumber max_pblkno = InvalidBlockNumber;
+	int         current_slot = -1;
+
+	if (n_lblknos == 0)
+		return 0;
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return n_lblknos;
+
+	n_map_pages = umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+										 UMFILE_NBLOCKS_DENSE);
+	if (n_map_pages == 0)
+		return 0;
+	page_count = (n_lblknos + MAP_ENTRIES_PER_PAGE - 1) / MAP_ENTRIES_PER_PAGE;
+	for (page_idx = 0; page_idx < page_count; page_idx++)
+	{
+		BlockNumber	page_no = MapForkPageIndexToMapBlkno(forknum, page_idx);
+		int			entry_idx;
+		int			limit_idx;
+		MapPage	   *page;
+		MapBufferDesc *buf;
+
+		if (page_no >= n_map_pages)
+			break;
+
+		if (page_no != current_page)
+		{
+			if (current_slot >= 0)
+				MapUnpinBuffer(current_slot);
+			current_slot = MapReadBuffer(map_ctx, rnode, forknum, page_no);
+			current_page = page_no;
+		}
+
+		buf = &MapBuffers[current_slot];
+		page = MapGetPage(current_slot);
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		limit_idx = MAP_ENTRIES_PER_PAGE;
+		if (page_idx == page_count - 1 && (n_lblknos % MAP_ENTRIES_PER_PAGE) != 0)
+			limit_idx = n_lblknos % MAP_ENTRIES_PER_PAGE;
+		for (entry_idx = 0; entry_idx < limit_idx; entry_idx++)
+		{
+			BlockNumber pblkno = page->pblknos[entry_idx];
+
+			if (pblkno == InvalidBlockNumber)
+				continue;
+
+			if (max_pblkno == InvalidBlockNumber || pblkno > max_pblkno)
+				max_pblkno = pblkno;
+		}
+		LWLockRelease(&buf->buffer_lock);
+	}
+
+	if (current_slot >= 0)
+		MapUnpinBuffer(current_slot);
+
+	if (max_pblkno == InvalidBlockNumber)
+		return 0;
+	if (max_pblkno == InvalidBlockNumber - 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot represent physical block count beyond %u",
+						InvalidBlockNumber - 1)));
+
+	return max_pblkno + 1;
+}
diff --git a/src/backend/storage/map/mapbuf.c b/src/backend/storage/map/mapbuf.c
new file mode 100644
index 0000000000..cb8b59dfbc
--- /dev/null
+++ b/src/backend/storage/map/mapbuf.c
@@ -0,0 +1,414 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapbuf.c
+ *	  MAP buffer state, pinning, and I/O helpers.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "utils/memutils.h"
+
+/* local state for MapStartBufferIO and related functions */
+static MapBufferDesc *InProgressMapBuf = NULL;
+static int		   *MapPrivateRefCount = NULL;
+static MemoryContext MapLocalCxt = NULL;
+
+static void MapWaitIO(MapBufferDesc *buf);
+static void MapEnsureBufferMaterialized(UmbraFileContext *map_ctx,
+										MapBufferDesc *buf);
+
+void
+MapEnsurePrivateRefCount(void)
+{
+	if (MapPrivateRefCount == NULL)
+	{
+		if (MapLocalCxt == NULL)
+		{
+			MapLocalCxt = AllocSetContextCreate(TopMemoryContext,
+												"MapLocal",
+												ALLOCSET_DEFAULT_SIZES);
+			MemoryContextAllowInCriticalSection(MapLocalCxt, true);
+		}
+		MapPrivateRefCount = MemoryContextAllocZero(MapLocalCxt,
+													map_buffers * sizeof(int));
+	}
+}
+
+void
+MapBufferUpdateStateBits(MapBufferDesc *buf, uint32 set_bits, uint32 clear_bits)
+{
+	for (;;)
+	{
+		uint32		old_state;
+		uint32		new_state;
+
+		old_state = pg_atomic_read_u32(&buf->state);
+		new_state = (old_state | set_bits) & ~clear_bits;
+		if (pg_atomic_compare_exchange_u32(&buf->state, &old_state, new_state))
+			return;
+	}
+}
+
+static void
+MapEnsureBufferMaterialized(UmbraFileContext *map_ctx, MapBufferDesc *buf)
+{
+	uint32		state;
+	BlockNumber map_nblocks;
+	BlockNumber map_blkno;
+
+	Assert(map_ctx != NULL);
+	Assert(buf != NULL);
+	Assert(LWLockHeldByMeInMode(&buf->buffer_lock, LW_EXCLUSIVE));
+	Assert(buf->page_number >= 0);
+	Assert(buf->page_number != MAP_BLOCK_SUPER);
+
+	state = pg_atomic_read_u32(&buf->state);
+	if ((state & MAPBUF_NOT_MATERIALIZED) == 0)
+		return;
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		elog(PANIC,
+			 "cannot materialize MAP page %d for relation %u/%u/%u without MAP fork",
+			 buf->page_number,
+			 buf->rnode.spcOid,
+			 buf->rnode.dbOid,
+			 buf->rnode.relNumber);
+
+	map_nblocks = umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+										 UMFILE_NBLOCKS_DENSE);
+	map_blkno = (BlockNumber) buf->page_number;
+
+	if (map_blkno >= map_nblocks)
+	{
+		/*
+		 * Mirror buffer-pool extension ownership: create the physical block
+		 * at first dirtying, not during checkpoint flush.
+		 */
+		umfile_zeroextend(map_ctx, UMBRA_METADATA_FORKNUM,
+						  map_nblocks,
+						  (int) (map_blkno + 1 - map_nblocks),
+						  false);
+	}
+
+	MapBufferUpdateStateBits(buf, 0, MAPBUF_NOT_MATERIALIZED);
+}
+
+void
+MapMarkBufferDirty(UmbraFileContext *map_ctx, MapBufferDesc *buf,
+				   XLogRecPtr page_lsn)
+{
+	Assert(buf != NULL);
+	Assert(LWLockHeldByMeInMode(&buf->buffer_lock, LW_EXCLUSIVE));
+
+	if (buf->page_number != MAP_BLOCK_SUPER)
+		MapEnsureBufferMaterialized(map_ctx, buf);
+
+	buf->page_lsn = page_lsn;
+	MapBufferUpdateStateBits(buf, MAPBUF_DIRTY | MAPBUF_JUST_DIRTIED, 0);
+}
+
+/*
+ * MapWaitIO -- Block until MAPBUF_IO_IN_PROGRESS is cleared.
+ */
+static void
+MapWaitIO(MapBufferDesc *buf)
+{
+	for (;;)
+	{
+		uint32		state;
+
+		state = pg_atomic_read_u32(&buf->state);
+		if (!(state & MAPBUF_IO_IN_PROGRESS))
+			break;
+
+		LWLockAcquire(&buf->io_in_progress_lock, LW_SHARED);
+		LWLockRelease(&buf->io_in_progress_lock);
+	}
+}
+
+/*
+ * MapStartBufferIO -- begin output I/O on this map buffer.
+ *
+ * Returns true if caller should perform I/O; false if page is already clean or
+ * no longer has the caller-required state bits.
+ */
+bool
+MapStartBufferIO(MapBufferDesc *buf, uint32 required_bits)
+{
+	uint32		state;
+
+	Assert(!InProgressMapBuf);
+
+	for (;;)
+	{
+		LWLockAcquire(&buf->io_in_progress_lock, LW_EXCLUSIVE);
+		state = pg_atomic_read_u32(&buf->state);
+
+		if (!(state & MAPBUF_IO_IN_PROGRESS))
+			break;
+
+		/*
+		 * Another backend is finishing I/O (or recovering from an error); wait
+		 * for the in-progress bit to clear before retrying.
+		 */
+		LWLockRelease(&buf->io_in_progress_lock);
+		MapWaitIO(buf);
+	}
+
+	if ((state & MAPBUF_DIRTY) == 0 ||
+		(state & required_bits) != required_bits)
+	{
+		LWLockRelease(&buf->io_in_progress_lock);
+		return false;
+	}
+
+	for (;;)
+	{
+		uint32		new_state;
+		uint32		expected;
+
+		if ((state & MAPBUF_DIRTY) == 0 ||
+			(state & required_bits) != required_bits)
+		{
+			LWLockRelease(&buf->io_in_progress_lock);
+			return false;
+		}
+
+		expected = state;
+		new_state = (state | MAPBUF_IO_IN_PROGRESS) &
+			~(MAPBUF_IO_ERROR | MAPBUF_JUST_DIRTIED);
+		if (pg_atomic_compare_exchange_u32(&buf->state, &expected, new_state))
+			break;
+		state = expected;
+	}
+
+	InProgressMapBuf = buf;
+	return true;
+}
+
+/*
+ * MapTerminateBufferIO -- complete output I/O state transition.
+ *
+ * Assumes this backend owns I/O on buf.
+ */
+void
+MapTerminateBufferIO(MapBufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
+{
+	for (;;)
+	{
+		uint32		old_state;
+		uint32		new_state;
+
+		old_state = pg_atomic_read_u32(&buf->state);
+		Assert(old_state & MAPBUF_IO_IN_PROGRESS);
+
+		new_state = old_state & ~(MAPBUF_IO_IN_PROGRESS | MAPBUF_IO_ERROR);
+		if (clear_dirty)
+		{
+			new_state &= ~MAPBUF_CHECKPOINT_NEEDED;
+			if (!(old_state & MAPBUF_JUST_DIRTIED))
+				new_state &= ~MAPBUF_DIRTY;
+		}
+		new_state |= set_flag_bits;
+
+		if (pg_atomic_compare_exchange_u32(&buf->state, &old_state, new_state))
+			break;
+	}
+
+	InProgressMapBuf = NULL;
+	LWLockRelease(&buf->io_in_progress_lock);
+}
+
+/*
+ * MapAbortBufferIO -- cleanup map buffer I/O after an ERROR.
+ */
+void
+MapAbortBufferIO(void)
+{
+	MapBufferDesc *buf = InProgressMapBuf;
+	uint32		state;
+
+	if (buf == NULL)
+		return;
+
+	LWLockAcquire(&buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+	state = pg_atomic_read_u32(&buf->state);
+	if (state & MAPBUF_IO_IN_PROGRESS)
+		MapTerminateBufferIO(buf, false, MAPBUF_IO_ERROR);
+	else
+	{
+		InProgressMapBuf = NULL;
+		LWLockRelease(&buf->io_in_progress_lock);
+	}
+}
+
+void
+MapBackendExitCleanup(void)
+{
+	int			slot_id;
+
+	/*
+	 * First clear in-progress map I/O ownership, so other waiters can make
+	 * progress even if current backend is leaving via ERROR/abort.
+	 */
+	MapAbortBufferIO();
+	if (MapPrivateRefCount == NULL)
+		return;
+
+	/* Release all map pins held by this backend. */
+	for (slot_id = 0; slot_id < map_buffers; slot_id++)
+	{
+		while (MapPrivateRefCount[slot_id] > 0)
+			MapUnpinBuffer(slot_id);
+	}
+
+	MapResetAllTruncatePreloads();
+
+#ifdef USE_ASSERT_CHECKING
+	Assert(InProgressMapBuf == NULL);
+	for (slot_id = 0; slot_id < map_buffers; slot_id++)
+	{
+		Assert(MapPrivateRefCount[slot_id] == 0);
+		Assert(!LWLockHeldByMe(&MapBuffers[slot_id].buffer_lock));
+		Assert(!LWLockHeldByMe(&MapBuffers[slot_id].io_in_progress_lock));
+	}
+#endif
+}
+
+/*
+ * MapPinBuffer - pin a map buffer
+ *
+ * Increments the refcount for the buffer. If adjust_usage is true,
+ * also increments the usage_count (up to max 5).
+ */
+void
+MapPinBuffer(int slot_id, bool adjust_usage)
+{
+	uint32_t	state;
+
+	MapEnsurePrivateRefCount();
+
+	/* Increment shared refcount first. */
+	while (true)
+	{
+		uint32_t	old_state = pg_atomic_read_u32(&MapBuffers[slot_id].state);
+		uint32_t	new_state = old_state + 1;
+
+		if (MAPBUF_GET_REFCOUNT(old_state) >= MAPBUF_VALID_MASK)
+			elog(ERROR, "map buffer reference count overflow");
+
+		if (pg_atomic_compare_exchange_u32(&MapBuffers[slot_id].state,
+										   &old_state, new_state))
+		{
+			state = new_state;
+			break;
+		}
+	}
+
+	MapPrivateRefCount[slot_id]++;
+	Assert(MapPrivateRefCount[slot_id] > 0);
+
+	/* Increment usage count if requested. */
+	if (adjust_usage && MAPBUF_GET_USAGECOUNT(state) < 5)
+	{
+		while (true)
+		{
+			uint32_t	old_state = pg_atomic_read_u32(&MapBuffers[slot_id].state);
+			uint32_t	new_state = old_state + MAPBUF_USAGECOUNT_ONE;
+
+			if (pg_atomic_compare_exchange_u32(&MapBuffers[slot_id].state,
+											   &old_state, new_state))
+				break;
+		}
+	}
+}
+
+/*
+ * MapUnpinBuffer - unpin a map buffer
+ *
+ * Decrements the refcount for the buffer.
+ */
+void
+MapUnpinBuffer(int slot_id)
+{
+	MapEnsurePrivateRefCount();
+
+	if (MapPrivateRefCount[slot_id] == 0)
+		elog(ERROR, "map buffer private refcount underflow");
+
+	while (true)
+	{
+		uint32_t	old_state = pg_atomic_read_u32(&MapBuffers[slot_id].state);
+		uint32_t	new_state = old_state - 1;
+
+		if (MAPBUF_GET_REFCOUNT(old_state) == 0)
+			elog(ERROR, "map buffer refcount underflow");
+
+		if (pg_atomic_compare_exchange_u32(&MapBuffers[slot_id].state,
+										   &old_state, new_state))
+			break;
+	}
+
+	MapPrivateRefCount[slot_id]--;
+}
+
+/*
+ * MapInvalidateBuffer - invalidate a buffer slot for a specific mapping tag.
+ *
+ * This follows buffer-pool invalidation semantics:
+ * - caller identifies expected tag and slot
+ * - if slot tag changed while waiting, do nothing
+ * - if slot is still pinned, wait/retry until safe to invalidate
+ */
+void
+MapInvalidateBuffer(int slot_id, RelFileLocator expected_rnode,
+					ForkNumber expected_forknum,
+					BlockNumber expected_map_blkno)
+{
+	MapBufferDesc *buf = &MapBuffers[slot_id];
+	uint32		state;
+
+retry:
+	LWLockAcquire(&buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+	LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+	if (buf->page_number < 0 ||
+		buf->page_number != expected_map_blkno ||
+		buf->forknum != expected_forknum ||
+		!RelFileLocatorEquals(buf->rnode, expected_rnode))
+	{
+		LWLockRelease(&buf->buffer_lock);
+		LWLockRelease(&buf->io_in_progress_lock);
+		return;
+	}
+
+	state = pg_atomic_read_u32(&buf->state);
+	if (MAPBUF_GET_REFCOUNT(state) != 0)
+	{
+		LWLockRelease(&buf->buffer_lock);
+		LWLockRelease(&buf->io_in_progress_lock);
+
+		if (MapPrivateRefCount != NULL &&
+			MapPrivateRefCount[slot_id] > 0)
+			elog(ERROR, "map buffer is pinned in MapInvalidateBuffer");
+
+		MapWaitIO(buf);
+		goto retry;
+	}
+	buf->page_number = -1;
+	buf->forknum = InvalidForkNumber;
+	memset(&buf->rnode, 0, sizeof(RelFileLocator));
+	buf->page_lsn = 0;
+	LWLockRelease(&buf->buffer_lock);
+
+	/* Reset full state before returning slot to free list. */
+	pg_atomic_write_u32(&buf->state, 0);
+	MapClockFreeBuffer(slot_id);
+	LWLockRelease(&buf->io_in_progress_lock);
+}
diff --git a/src/backend/storage/map/mapclock.c b/src/backend/storage/map/mapclock.c
new file mode 100644
index 0000000000..6fa62e1c1a
--- /dev/null
+++ b/src/backend/storage/map/mapclock.c
@@ -0,0 +1,457 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapclock.c
+ *	  clock sweep algorithm for map buffer replacement
+ *
+ * This implements a clock sweep algorithm similar to freelist.c,
+ * but for managing map buffers instead of data buffers.
+ *
+ * Also handles the map cache hash table, similar to buf_table.c.
+ *
+ * src/backend/storage/map/mapclock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/hsearch.h"
+
+#define LOG2_NUM_MAP_CACHE_PARTITIONS 5
+#define NUM_MAP_CACHE_PARTITIONS (1 << LOG2_NUM_MAP_CACHE_PARTITIONS)
+
+typedef struct MapCacheTag
+{
+	RelFileLocator	rnode;
+	ForkNumber	forknum;
+	BlockNumber	map_blkno;
+} MapCacheTag;
+
+typedef struct MapCacheEntry
+{
+	MapCacheTag	key;
+	int			slot_id;
+} MapCacheEntry;
+
+static HTAB *MapCacheHash = NULL;
+static LWLockPadded *MapCachePartitionLocks = NULL;
+
+static inline uint32
+MapCacheHashCode(MapCacheTag *tag)
+{
+	Assert(MapCacheHash != NULL);
+	return get_hash_value(MapCacheHash, (void *) tag);
+}
+
+static inline LWLock *
+MapCachePartitionLock(uint32 hashcode)
+{
+	return &MapCachePartitionLocks[hashcode & (NUM_MAP_CACHE_PARTITIONS - 1)].lock;
+}
+
+void
+MapCacheTableShmemRequest(void)
+{
+	long		hash_size;
+
+	hash_size = Max((long) map_buffers, (long) map_buffers * 2L);
+
+	ShmemRequestStruct(.name = "Map Cache Partition Locks",
+					   .size = NUM_MAP_CACHE_PARTITIONS * sizeof(LWLockPadded),
+					   .ptr = (void **) &MapCachePartitionLocks,
+		);
+
+	ShmemRequestHash(.name = "Map Cache Lookup Table",
+					 .nelems = hash_size,
+					 .ptr = &MapCacheHash,
+					 .hash_info.keysize = sizeof(MapCacheTag),
+					 .hash_info.entrysize = sizeof(MapCacheEntry),
+					 .hash_info.num_partitions = NUM_MAP_CACHE_PARTITIONS,
+					 .hash_flags = HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+		);
+}
+
+void
+MapCacheTableShmemInit(void)
+{
+	int			i;
+
+	for (i = 0; i < NUM_MAP_CACHE_PARTITIONS; i++)
+		LWLockInitialize(&MapCachePartitionLocks[i].lock,
+						 LWTRANCHE_MAP_BUFFER_CONTENT);
+}
+
+/*
+ * MapCacheLookup - lookup a buffer slot in the cache
+ * Returns slot_id if found, -1 otherwise
+ */
+int
+MapCacheLookup(RelFileLocator rnode, ForkNumber forknum, BlockNumber map_blkno)
+{
+	MapCacheTag	tag;
+	MapCacheEntry *entry;
+	uint32		hashcode;
+	int			slot_id = -1;
+	LWLock	   *partition_lock;
+
+	tag.rnode = rnode;
+	tag.forknum = forknum;
+	tag.map_blkno = map_blkno;
+	hashcode = MapCacheHashCode(&tag);
+	partition_lock = MapCachePartitionLock(hashcode);
+
+	LWLockAcquire(partition_lock, LW_SHARED);
+	entry = (MapCacheEntry *)
+		hash_search_with_hash_value(MapCacheHash,
+									(void *) &tag,
+									hashcode,
+									HASH_FIND,
+									NULL);
+	if (entry != NULL)
+		slot_id = entry->slot_id;
+	LWLockRelease(partition_lock);
+
+	return slot_id;
+}
+
+/*
+ * MapCacheInsert - insert a buffer slot into the cache.
+ *
+ * Returns -1 on successful insertion. If another slot already owns the tag,
+ * returns that slot id and leaves the existing entry unchanged.
+ */
+int
+MapCacheInsert(RelFileLocator rnode, ForkNumber forknum, BlockNumber map_blkno, int slot_id)
+{
+	MapCacheTag	tag;
+	MapCacheEntry *entry;
+	uint32		hashcode;
+	bool		found;
+	LWLock	   *partition_lock;
+
+	Assert(slot_id >= 0);
+
+	tag.rnode = rnode;
+	tag.forknum = forknum;
+	tag.map_blkno = map_blkno;
+	hashcode = MapCacheHashCode(&tag);
+	partition_lock = MapCachePartitionLock(hashcode);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+	entry = (MapCacheEntry *)
+		hash_search_with_hash_value(MapCacheHash,
+									(void *) &tag,
+									hashcode,
+									HASH_ENTER,
+									&found);
+	if (found)
+	{
+		int			existing_slot_id = entry->slot_id;
+
+		LWLockRelease(partition_lock);
+		return existing_slot_id;
+	}
+
+	entry->slot_id = slot_id;
+	LWLockRelease(partition_lock);
+
+	return -1;
+}
+
+/*
+ * MapCacheDelete - remove a buffer slot from the cache
+ */
+void
+MapCacheDelete(RelFileLocator rnode, ForkNumber forknum, BlockNumber map_blkno,
+			   int slot_id)
+{
+	MapCacheTag	tag;
+	MapCacheEntry *entry;
+	uint32		hashcode;
+	LWLock	   *partition_lock;
+
+	Assert(slot_id >= 0);
+
+	tag.rnode = rnode;
+	tag.forknum = forknum;
+	tag.map_blkno = map_blkno;
+	hashcode = MapCacheHashCode(&tag);
+	partition_lock = MapCachePartitionLock(hashcode);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+	entry = (MapCacheEntry *)
+		hash_search_with_hash_value(MapCacheHash,
+									(void *) &tag,
+									hashcode,
+									HASH_FIND,
+									NULL);
+	if (entry != NULL && entry->slot_id == slot_id)
+	{
+		(void) hash_search_with_hash_value(MapCacheHash,
+										   (void *) &tag,
+										   hashcode,
+										   HASH_REMOVE,
+										   NULL);
+	}
+	LWLockRelease(partition_lock);
+}
+
+/*
+ * ClockSweepTick - advance the clock hand
+ *
+ * Returns the next slot to examine.
+ */
+static inline uint32
+ClockSweepTick(void)
+{
+	uint32      victim;
+	int         num_slots;
+
+	num_slots = MapShared->num_slots;
+
+	/*
+	 * Atomically move hand ahead one slot.
+	 * Multiple processes can do this concurrently.
+	 */
+	victim = pg_atomic_fetch_add_u32(&MapShared->next_victim_buffer, 1);
+
+	/* Handle wraparound */
+	if (victim >= (uint32) num_slots)
+	{
+		uint32      originalVictim = victim;
+
+		/* What we actually look up in MapBuffers */
+		victim = victim % num_slots;
+
+		/*
+		 * If we're the one that just caused a wraparound, increment
+		 * completePasses while holding the lock.
+		 */
+		if (victim == 0)
+		{
+			uint32      expected;
+			uint32      wrapped;
+			bool        success = false;
+
+			expected = originalVictim + 1;
+
+			while (!success)
+			{
+				SpinLockAcquire(&MapShared->clock_lock);
+
+				wrapped = expected % num_slots;
+
+				success = pg_atomic_compare_exchange_u32(
+					&MapShared->next_victim_buffer,
+					&expected, wrapped);
+				if (success)
+					MapShared->complete_passes++;
+
+				SpinLockRelease(&MapShared->clock_lock);
+			}
+		}
+	}
+
+	return victim;
+}
+
+/*
+ * MapClockGetBuffer - select a buffer slot using clock algorithm
+ *
+ * Returns a slot ID that is safe to use (not pinned).
+ * The caller is responsible for initializing the slot.
+ */
+int
+MapClockGetBuffer(void)
+{
+	MapBufferDesc *buf;
+	int         trycounter;
+	uint32      local_buf_state;
+	int         num_slots = MapShared->num_slots;
+
+	/*
+	 * First, check if there's a buffer on the free list.
+	 */
+	if (MapShared->first_free_buffer >= 0)
+	{
+		while (true)
+		{
+			int         slot_id;
+
+			SpinLockAcquire(&MapShared->clock_lock);
+
+			if (MapShared->first_free_buffer < 0)
+			{
+				SpinLockRelease(&MapShared->clock_lock);
+				break;
+			}
+
+			slot_id = MapShared->first_free_buffer;
+			buf = &MapBuffers[slot_id];
+
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+			/* Remove from free list */
+			MapShared->first_free_buffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
+
+			SpinLockRelease(&MapShared->clock_lock);
+
+			/*
+			 * Check if the buffer is actually usable.
+			 * (It might have been used after being put on free list)
+			 */
+			local_buf_state = pg_atomic_read_u32(&buf->state);
+
+			if (MAPBUF_GET_REFCOUNT(local_buf_state) == 0 &&
+				MAPBUF_GET_USAGECOUNT(local_buf_state) == 0)
+			{
+				/* Found a usable buffer */
+				pg_atomic_fetch_add_u32(&MapShared->num_allocs, 1);
+				return slot_id;
+			}
+
+			/*
+			 * Buffer not usable (pinned or still has usage_count).
+			 *
+			 * Keep it off free list and let normal clock sweep handle it.
+			 * Re-queuing it at free-list head can livelock when the same
+			 * non-usable slot is popped repeatedly.
+			 */
+			continue;
+		}
+	}
+
+	/*
+	 * No free buffers, run the clock sweep algorithm.
+	 */
+	trycounter = num_slots;
+
+	for (;;)
+	{
+		uint32      victim_slot;
+
+		victim_slot = ClockSweepTick();
+		buf = &MapBuffers[victim_slot];
+
+		local_buf_state = pg_atomic_read_u32(&buf->state);
+
+		/*
+		 * If the buffer is pinned, we cannot use it.
+		 * If it has a non-zero usage_count, decrement it and continue.
+		 */
+		if (MAPBUF_GET_REFCOUNT(local_buf_state) == 0)
+		{
+			if (MAPBUF_GET_USAGECOUNT(local_buf_state) != 0)
+			{
+				/* Decrement usage_count */
+				uint32_t    old_state;
+				uint32_t    new_state;
+
+				do
+				{
+					old_state = pg_atomic_read_u32(&buf->state);
+					new_state = old_state - MAPBUF_USAGECOUNT_ONE;
+				}
+				while (!pg_atomic_compare_exchange_u32(&buf->state,
+														&old_state, new_state));
+
+				/* Reset try counter since we made progress */
+				trycounter = num_slots;
+			}
+			else
+			{
+				/* Found a usable buffer */
+				pg_atomic_fetch_add_u32(&MapShared->num_allocs, 1);
+
+				/* Dirty-victim writeback is handled by caller (MapReadBuffer). */
+
+				return (int) victim_slot;
+			}
+		}
+		else if (--trycounter == 0)
+		{
+			/*
+			 * We've scanned all buffers and all are pinned.
+			 * This shouldn't happen with reasonable sizing.
+			 */
+			elog(ERROR, "no unpinned map buffers available");
+		}
+	}
+}
+
+/*
+ * MapClockFreeBuffer - return a buffer to the free list
+ *
+ * Low-level function that adds a buffer to the free list.
+ * The buffer's state should already be cleaned before calling this.
+ * This is called by MapInvalidateBuffer.
+ */
+void
+MapClockFreeBuffer(int slot_id)
+{
+	MapBufferDesc *buf;
+	uint32      state;
+
+	buf = &MapBuffers[slot_id];
+
+	/* Check if buffer is already on free list */
+	SpinLockAcquire(&MapShared->clock_lock);
+
+	if (buf->freeNext != FREENEXT_NOT_IN_LIST)
+	{
+		/* Already on free list, just return */
+		SpinLockRelease(&MapShared->clock_lock);
+		return;
+	}
+
+	/*
+	 * Free list must only contain fully reusable slots.
+	 * Caller is responsible for clearing refcount/usage first.
+	 */
+	state = pg_atomic_read_u32(&buf->state);
+	Assert(MAPBUF_GET_REFCOUNT(state) == 0);
+	Assert(MAPBUF_GET_USAGECOUNT(state) == 0);
+
+	/* Insert at head of free list */
+	buf->freeNext = MapShared->first_free_buffer;
+	MapShared->first_free_buffer = slot_id;
+
+	SpinLockRelease(&MapShared->clock_lock);
+}
+
+/*
+ * MapSyncStart - tell checkpoint where to start syncing
+ *
+ * Returns the starting slot ID for checkpoint sync.
+ */
+int
+MapSyncStart(uint32 *complete_passes, uint32 *num_allocs)
+{
+	uint32      next_victim;
+	int         result;
+
+	SpinLockAcquire(&MapShared->clock_lock);
+
+	next_victim = pg_atomic_read_u32(&MapShared->next_victim_buffer);
+	result = next_victim % MapShared->num_slots;
+
+	if (complete_passes)
+	{
+		*complete_passes = MapShared->complete_passes;
+		*complete_passes += next_victim / MapShared->num_slots;
+	}
+
+	if (num_allocs)
+	{
+		*num_allocs = pg_atomic_exchange_u32(&MapShared->num_allocs, 0);
+	}
+
+	SpinLockRelease(&MapShared->clock_lock);
+
+	return result;
+}
diff --git a/src/backend/storage/map/mapflush.c b/src/backend/storage/map/mapflush.c
new file mode 100644
index 0000000000..def1943dee
--- /dev/null
+++ b/src/backend/storage/map/mapflush.c
@@ -0,0 +1,665 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapflush.c
+ *	  MAP checkpoint and writeback implementation.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogutils.h"
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "storage/mapsuper_internal.h"
+#include "storage/umbra.h"
+#include "storage/umfile.h"
+
+typedef struct MapFlushWriteCache
+{
+	bool				valid;
+	RelFileLocatorBackend rlocator;
+	UmbraFileContext   *ctx;
+} MapFlushWriteCache;
+
+typedef struct MapFlushBufferTarget
+{
+	int				slot_id;
+	RelFileLocator	rnode;
+	BlockNumber		map_blkno;
+} MapFlushBufferTarget;
+
+static void MapFlushWriteCacheReset(MapFlushWriteCache *cache);
+static UmbraFileContext *MapFlushContextFor(MapFlushWriteCache *cache,
+											RelFileLocator rnode);
+static void MapFlushWritePage(RelFileLocatorBackend rlocator,
+							  UmbraFileContext *ctx,
+							  BlockNumber map_blkno,
+							  const void *page,
+							  XLogRecPtr page_lsn);
+static void MapFlushWriteSuperblockEntry(RelFileLocator rnode,
+										 MapSuperEntry *entry);
+static int MapCollectDirtyBufferTargets(MapFlushBufferTarget **targets_out,
+										const RelFileLocator *filter_rnode,
+										bool mark_checkpoint_needed);
+static int MapFlushDirtyBuffers(int max_pages, bool checkpoint);
+static int MapFlushRelationBuffers(RelFileLocator rnode, bool checkpoint);
+static int MapFlushDirtySuperblocks(void);
+static int MapFlushRelationSuperblocks(RelFileLocator rnode);
+static void MapFlushBufferCached(int slot_id, MapFlushWriteCache *write_cache,
+								 bool checkpoint);
+static bool MapTablespaceSelected(Oid spcOid, int ntablespaces,
+								  const Oid *tablespace_ids);
+static inline int map_flush_buffer_target_comparator(
+	const MapFlushBufferTarget *a,
+	const MapFlushBufferTarget *b);
+
+#define ST_SORT sort_map_flush_buffer_targets
+#define ST_ELEMENT_TYPE MapFlushBufferTarget
+#define ST_COMPARE(a, b) map_flush_buffer_target_comparator(a, b)
+#define ST_SCOPE static
+#define ST_DEFINE
+#include "lib/sort_template.h"
+
+static void
+MapFlushWriteCacheReset(MapFlushWriteCache *cache)
+{
+	if (cache == NULL || !cache->valid)
+		return;
+
+	umfile_ctx_destroy_temporary(cache->ctx);
+	cache->ctx = NULL;
+	cache->valid = false;
+	memset(&cache->rlocator, 0, sizeof(cache->rlocator));
+}
+
+static UmbraFileContext *
+MapFlushContextFor(MapFlushWriteCache *cache, RelFileLocator rnode)
+{
+	Assert(cache != NULL);
+
+	if (cache->valid && RelFileLocatorEquals(cache->rlocator.locator, rnode))
+		return cache->ctx;
+
+	MapFlushWriteCacheReset(cache);
+
+	cache->rlocator.locator = rnode;
+	cache->rlocator.backend = INVALID_PROC_NUMBER;
+	cache->ctx = umfile_ctx_create_temporary(cache->rlocator);
+	cache->valid = true;
+	return cache->ctx;
+}
+
+static void
+MapFlushWritePage(RelFileLocatorBackend rlocator, UmbraFileContext *ctx,
+				  BlockNumber map_blkno, const void *page,
+				  XLogRecPtr page_lsn)
+{
+	Assert(ctx != NULL);
+	Assert(page != NULL);
+	Assert(map_blkno != MAP_BLOCK_SUPER);
+	Assert(umfile_ctx_fork_exists(ctx, UMBRA_METADATA_FORKNUM,
+								 UMFILE_EXISTS_DENSE));
+
+	if (!InRecovery && page_lsn != InvalidXLogRecPtr)
+		XLogFlush(page_lsn);
+
+	umfile_ctx_write(ctx, UMBRA_METADATA_FORKNUM, map_blkno,
+					 page, BLCKSZ, false);
+	umfile_ctx_register_dirty(ctx, UMBRA_METADATA_FORKNUM, map_blkno,
+							  false,
+							  RelFileLocatorBackendIsTemp(rlocator));
+}
+
+static void
+MapFlushWriteSuperblockEntry(RelFileLocator rnode, MapSuperEntry *entry)
+{
+	RelFileLocatorBackend rlocator = {0};
+	char			sector[MAP_SUPERBLOCK_SIZE];
+
+	Assert(entry != NULL);
+
+	if (!InRecovery && entry->page_lsn != InvalidXLogRecPtr)
+		XLogFlush(entry->page_lsn);
+
+	rlocator.locator = rnode;
+	rlocator.backend = INVALID_PROC_NUMBER;
+
+	MapSuperblockSetLastUpdatedLSN(&entry->super, entry->page_lsn);
+	MapSuperblockRefreshCRC(&entry->super);
+	MapSuperblockPackSector(&entry->super, sector);
+	UmMetadataWriteSuperblock(rlocator, sector, false);
+}
+
+static int
+MapCollectDirtyBufferTargets(MapFlushBufferTarget **targets_out,
+							 const RelFileLocator *filter_rnode,
+							 bool mark_checkpoint_needed)
+{
+	MapFlushBufferTarget *targets;
+	int			target_cap = 256;
+	int			target_count = 0;
+
+	Assert(targets_out != NULL);
+
+	targets = palloc(sizeof(MapFlushBufferTarget) * target_cap);
+
+	for (int i = 0; i < map_buffers; i++)
+	{
+		MapBufferDesc *buf = &MapBuffers[i];
+		uint32		state;
+		int			page_number;
+		RelFileLocator slot_rnode;
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		state = pg_atomic_read_u32(&buf->state);
+		if ((state & MAPBUF_DIRTY) == 0)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			continue;
+		}
+
+		page_number = buf->page_number;
+		slot_rnode = buf->rnode;
+
+		if (page_number < 0 || page_number == MAP_BLOCK_SUPER)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			continue;
+		}
+		if (filter_rnode != NULL &&
+			!RelFileLocatorEquals(slot_rnode, *filter_rnode))
+		{
+			LWLockRelease(&buf->buffer_lock);
+			continue;
+		}
+
+		if (mark_checkpoint_needed)
+			MapBufferUpdateStateBits(buf, MAPBUF_CHECKPOINT_NEEDED, 0);
+
+		LWLockRelease(&buf->buffer_lock);
+
+		if (target_count >= target_cap)
+		{
+			target_cap *= 2;
+			targets = repalloc(targets,
+							   sizeof(MapFlushBufferTarget) * target_cap);
+		}
+
+		targets[target_count].slot_id = i;
+		targets[target_count].rnode = slot_rnode;
+		targets[target_count].map_blkno = (BlockNumber) page_number;
+		target_count++;
+	}
+
+	if (target_count > 1)
+		sort_map_flush_buffer_targets(targets, target_count);
+
+	*targets_out = targets;
+	return target_count;
+}
+
+static inline int
+map_flush_buffer_target_comparator(const MapFlushBufferTarget *a,
+								   const MapFlushBufferTarget *b)
+{
+	if (a->rnode.spcOid < b->rnode.spcOid)
+		return -1;
+	else if (a->rnode.spcOid > b->rnode.spcOid)
+		return 1;
+	else if (a->rnode.dbOid < b->rnode.dbOid)
+		return -1;
+	else if (a->rnode.dbOid > b->rnode.dbOid)
+		return 1;
+	else if (a->rnode.relNumber < b->rnode.relNumber)
+		return -1;
+	else if (a->rnode.relNumber > b->rnode.relNumber)
+		return 1;
+	else if (a->map_blkno < b->map_blkno)
+		return -1;
+	else if (a->map_blkno > b->map_blkno)
+		return 1;
+
+	return 0;
+}
+
+void
+MapPreCheckpoint(void)
+{
+	/* no-op: checkpoint work is handled by MapCheckpoint(). */
+}
+
+/*
+ * MapCheckpoint - sync dirty map pages during checkpoint
+ *
+ * Scans all buffer slots and writes dirty pages to disk.
+ * Must handle concurrent access from other backends.
+ */
+void
+MapCheckpoint(void)
+{
+	/*
+	 * Checkpoint ordering: persist regular MAP pages first, then superblocks.
+	 * This keeps on-disk superblock as a checkpoint-boundary snapshot and
+	 * avoids it getting ahead of mapping-page durability.
+	 */
+	(void) MapFlushDirtyBuffers(-1, true);
+	(void) MapFlushDirtySuperblocks();
+}
+
+void
+MapCheckpointRelation(RelFileLocator rnode)
+{
+	(void) MapFlushRelationBuffers(rnode, true);
+	(void) MapFlushRelationSuperblocks(rnode);
+}
+
+void
+MapCheckpointDatabaseTablespaces(Oid dbid, int ntablespaces,
+								 const Oid *tablespace_ids)
+{
+	RelFileLocator *targets;
+	int			target_cap = 256;
+	int			target_count = 0;
+	int			i;
+
+	targets = palloc(sizeof(RelFileLocator) * target_cap);
+
+	for (i = 0; i < map_buffers; i++)
+	{
+		MapBufferDesc *buf = &MapBuffers[i];
+		uint32		state_before;
+		int			page_number;
+		RelFileLocator slot_rnode;
+
+		state_before = pg_atomic_read_u32(&buf->state);
+		if ((state_before & MAPBUF_DIRTY) == 0)
+			continue;
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		page_number = buf->page_number;
+		slot_rnode = buf->rnode;
+		LWLockRelease(&buf->buffer_lock);
+
+		if (page_number < 0 ||
+			slot_rnode.dbOid != dbid ||
+			!MapTablespaceSelected(slot_rnode.spcOid, ntablespaces,
+								   tablespace_ids))
+			continue;
+
+		if (target_count >= target_cap)
+		{
+			target_cap *= 2;
+			targets = repalloc(targets, sizeof(RelFileLocator) * target_cap);
+		}
+		targets[target_count++] = slot_rnode;
+	}
+
+	for (i = 0; i < MapSuperCapacity; i++)
+	{
+		MapSuperEntry *entry = MapSuperEntryBySlot(i);
+		RelFileLocator rnode;
+
+		LWLockAcquire(&entry->lock, LW_SHARED);
+		if (!entry->in_use ||
+			(entry->flags & MAPSUPER_FLAG_DIRTY) == 0 ||
+			entry->key.rnode.dbOid != dbid ||
+			!MapTablespaceSelected(entry->key.rnode.spcOid, ntablespaces,
+								   tablespace_ids))
+		{
+			LWLockRelease(&entry->lock);
+			continue;
+		}
+		rnode = entry->key.rnode;
+		LWLockRelease(&entry->lock);
+
+		if (target_count >= target_cap)
+		{
+			target_cap *= 2;
+			targets = repalloc(targets, sizeof(RelFileLocator) * target_cap);
+		}
+		targets[target_count++] = rnode;
+	}
+
+	for (i = 0; i < target_count; i++)
+	{
+		int j;
+		bool seen = false;
+
+		for (j = 0; j < i; j++)
+		{
+			if (RelFileLocatorEquals(targets[j], targets[i]))
+			{
+				seen = true;
+				break;
+			}
+		}
+		if (seen)
+			continue;
+
+		MapCheckpointRelation(targets[i]);
+	}
+
+	pfree(targets);
+}
+
+void
+MapPostCheckpoint(void)
+{
+	/* no-op: checkpoint work is handled by MapCheckpoint(). */
+}
+
+int
+MapBgWriterFlush(int max_pages)
+{
+	if (max_pages <= 0)
+		return 0;
+
+	/* Non-checkpoint flushes regular MAP pages only; superblock is checkpoint-owned. */
+	return MapFlushDirtyBuffers(max_pages, false);
+}
+
+static int
+MapFlushDirtyBuffers(int max_pages, bool checkpoint)
+{
+	MapFlushBufferTarget *targets;
+	int			ntargets;
+	int			cleaned = 0;
+	MapFlushWriteCache write_cache = {0};
+
+	ntargets = MapCollectDirtyBufferTargets(&targets, NULL, checkpoint);
+
+	for (int i = 0; i < ntargets; i++)
+	{
+		MapBufferDesc *buf = &MapBuffers[targets[i].slot_id];
+		uint32		state_before;
+		uint32		state_after;
+
+		if (max_pages >= 0 && cleaned >= max_pages)
+			break;
+
+		state_before = pg_atomic_read_u32(&buf->state);
+		if ((state_before & MAPBUF_DIRTY) == 0)
+			continue;
+		if (checkpoint &&
+			(state_before & MAPBUF_CHECKPOINT_NEEDED) == 0)
+			continue;
+
+		MapFlushBufferCached(targets[i].slot_id, &write_cache, checkpoint);
+
+		state_after = pg_atomic_read_u32(&buf->state);
+		if (checkpoint)
+		{
+			if ((state_before & MAPBUF_CHECKPOINT_NEEDED) != 0 &&
+				(state_after & MAPBUF_CHECKPOINT_NEEDED) == 0)
+				cleaned++;
+		}
+		else if ((state_before & MAPBUF_DIRTY) != 0 &&
+			(state_after & MAPBUF_DIRTY) == 0)
+			cleaned++;
+	}
+
+	MapFlushWriteCacheReset(&write_cache);
+	pfree(targets);
+
+	return cleaned;
+}
+
+static int
+MapFlushRelationBuffers(RelFileLocator rnode, bool checkpoint)
+{
+	MapFlushBufferTarget *targets;
+	int			ntargets;
+	int			cleaned = 0;
+	MapFlushWriteCache write_cache = {0};
+
+	ntargets = MapCollectDirtyBufferTargets(&targets, &rnode, checkpoint);
+
+	for (int i = 0; i < ntargets; i++)
+	{
+		MapBufferDesc *buf = &MapBuffers[targets[i].slot_id];
+		uint32		state_before;
+		uint32		state_after;
+
+		state_before = pg_atomic_read_u32(&buf->state);
+		if ((state_before & MAPBUF_DIRTY) == 0)
+			continue;
+		if (checkpoint &&
+			(state_before & MAPBUF_CHECKPOINT_NEEDED) == 0)
+			continue;
+
+		MapFlushBufferCached(targets[i].slot_id, &write_cache, checkpoint);
+
+		state_after = pg_atomic_read_u32(&buf->state);
+		if (checkpoint)
+		{
+			if ((state_before & MAPBUF_CHECKPOINT_NEEDED) != 0 &&
+				(state_after & MAPBUF_CHECKPOINT_NEEDED) == 0)
+				cleaned++;
+		}
+		else if ((state_before & MAPBUF_DIRTY) != 0 &&
+			(state_after & MAPBUF_DIRTY) == 0)
+			cleaned++;
+	}
+
+	MapFlushWriteCacheReset(&write_cache);
+	pfree(targets);
+
+	return cleaned;
+}
+
+static int
+MapFlushDirtySuperblocks(void)
+{
+	typedef struct MapSuperDirtyTarget
+	{
+		RelFileLocator	rnode;
+	} MapSuperDirtyTarget;
+
+	MapSuperEntry *entry;
+	MapSuperDirtyTarget *targets;
+	int			target_cap = 256;
+	int			target_count;
+	bool		need_rescan;
+	int			cleaned = 0;
+
+	targets = palloc(sizeof(MapSuperDirtyTarget) * target_cap);
+
+	do
+	{
+		int			i;
+		int			slot_id;
+
+		target_count = 0;
+		need_rescan = false;
+
+		for (slot_id = 0; slot_id < MapSuperCapacity; slot_id++)
+		{
+			entry = MapSuperEntryBySlot(slot_id);
+			LWLockAcquire(&entry->lock, LW_SHARED);
+			if (entry->in_use &&
+				(entry->flags & MAPSUPER_FLAG_DIRTY) != 0)
+			{
+				if (target_count >= target_cap)
+				{
+					need_rescan = true;
+					LWLockRelease(&entry->lock);
+					break;
+				}
+				targets[target_count].rnode = entry->key.rnode;
+				target_count++;
+			}
+			LWLockRelease(&entry->lock);
+		}
+
+		for (i = 0; i < target_count; i++)
+		{
+			RelFileLocator	rnode = targets[i].rnode;
+
+			if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+				continue;
+
+			if ((entry->flags & MAPSUPER_FLAG_DIRTY) == 0)
+			{
+				LWLockRelease(&entry->lock);
+				continue;
+			}
+
+			if (!MapSuperblockHasValidIdentity(&entry->super))
+			{
+				LWLockRelease(&entry->lock);
+				MapSBlockReportCorrupt(rnode,
+									   "invalid identity while flushing");
+			}
+
+			MapFlushWriteSuperblockEntry(rnode, entry);
+
+			entry->flags &= ~MAPSUPER_FLAG_DIRTY;
+			cleaned++;
+			LWLockRelease(&entry->lock);
+		}
+
+		if (need_rescan)
+		{
+			target_cap *= 2;
+			targets = repalloc(targets,
+							   sizeof(MapSuperDirtyTarget) * target_cap);
+		}
+	}
+	while (need_rescan);
+
+	pfree(targets);
+
+	return cleaned;
+}
+
+static int
+MapFlushRelationSuperblocks(RelFileLocator rnode)
+{
+	MapSuperEntry *entry;
+	int			cleaned = 0;
+
+	if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+		return 0;
+
+	if ((entry->flags & MAPSUPER_FLAG_DIRTY) == 0)
+	{
+		LWLockRelease(&entry->lock);
+		return 0;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&entry->super))
+	{
+		LWLockRelease(&entry->lock);
+		MapSBlockReportCorrupt(rnode, "invalid identity while flushing");
+	}
+
+	MapFlushWriteSuperblockEntry(rnode, entry);
+
+	entry->flags &= ~MAPSUPER_FLAG_DIRTY;
+	cleaned++;
+	LWLockRelease(&entry->lock);
+
+	return cleaned;
+}
+
+void
+MapFlushBuffer(int slot_id)
+{
+	MapFlushBufferCached(slot_id, NULL, false);
+}
+
+static void
+MapFlushBufferCached(int slot_id, MapFlushWriteCache *write_cache,
+					 bool checkpoint)
+{
+	int				page_number;
+	BlockNumber		map_blkno;
+	RelFileLocator	rnode;
+	RelFileLocatorBackend rlocator;
+	XLogRecPtr		page_lsn;
+	MapBufferDesc   *buf;
+	MapPage		   *page;
+	UmbraFileContext *ctx;
+
+	buf = &MapBuffers[slot_id];
+	page = MapGetPage(slot_id);
+
+	/*
+	 * First lock I/O state so only one backend writes this slot. Hold content
+	 * lock exclusively while writing, so page content and page_lsn stay in
+	 * sync for writeback.
+	 */
+	if (!MapStartBufferIO(buf,
+						  checkpoint ? MAPBUF_CHECKPOINT_NEEDED : 0))
+		return;
+
+	LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+
+	page_number = buf->page_number;
+	rnode = buf->rnode;
+	page_lsn = buf->page_lsn;
+
+	if (page_number < 0)
+	{
+		/* Defensive cleanup: invalid slot must not stay dirty. */
+		MapTerminateBufferIO(buf, true, 0);
+		LWLockRelease(&buf->buffer_lock);
+		return;
+	}
+	map_blkno = (BlockNumber) page_number;
+
+	if (map_blkno == MAP_BLOCK_SUPER)
+	{
+		/*
+		 * Superblock is managed by the dedicated superblock table and must not
+		 * be present in the regular MAP buffer cache.
+		 */
+		MapTerminateBufferIO(buf, false, MAPBUF_IO_ERROR);
+		LWLockRelease(&buf->buffer_lock);
+		elog(ERROR, "MAP superblock cannot be flushed via map buffer cache");
+	}
+
+	/*
+	 * Flush by slot owner rnode without going through smgr/umopen again.
+	 * MapReadBuffer() can call this while a data-fork read already has an AIO
+	 * handle handed out, so reopening through smgr would recurse into Umbra
+	 * map-state lookup on the read path.
+	 */
+	if (write_cache != NULL)
+	{
+		ctx = MapFlushContextFor(write_cache, rnode);
+		rlocator = write_cache->rlocator;
+	}
+	else
+	{
+		rlocator.locator = rnode;
+		rlocator.backend = INVALID_PROC_NUMBER;
+		ctx = umfile_ctx_create_temporary(rlocator);
+	}
+
+	MapFlushWritePage(rlocator, ctx, map_blkno, (char *) page, page_lsn);
+
+	MapTerminateBufferIO(buf, true, 0);
+	LWLockRelease(&buf->buffer_lock);
+
+	if (write_cache == NULL)
+		umfile_ctx_destroy_temporary(ctx);
+}
+
+static bool
+MapTablespaceSelected(Oid spcOid, int ntablespaces, const Oid *tablespace_ids)
+{
+	int			i;
+
+	if (ntablespaces <= 0 || tablespace_ids == NULL)
+		return true;
+
+	for (i = 0; i < ntablespaces; i++)
+	{
+		if (tablespace_ids[i] == spcOid)
+			return true;
+	}
+
+	return false;
+}
diff --git a/src/backend/storage/map/mapinit.c b/src/backend/storage/map/mapinit.c
new file mode 100644
index 0000000000..a0880113ed
--- /dev/null
+++ b/src/backend/storage/map/mapinit.c
@@ -0,0 +1,143 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapinit.c
+ *	  shared-memory and backend initialization for the MAP layer
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/bufmgr.h"
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "storage/mapsuper.h"
+#include "storage/mapsuper_internal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+/* GUCs */
+int			map_buffers = 1024;	/* Number of map buffer slots */
+/*
+ * Dedicated shared-memory slots for MAP superblocks.
+ *
+ * These entries back extremely hot runtime metadata.  They are not managed as
+ * an LRU-style cache; instead they remain resident until explicit relation or
+ * database invalidation releases the slot.  Keep the default large so hot
+ * relations do not churn through repeated ensure/load cycles.
+ */
+int			map_superblocks = 262144;
+
+/* Shared memory pointer */
+MapSharedData *MapShared = NULL;
+
+/* Per-process buffer descriptors */
+MapBufferDesc *MapBuffers = NULL;
+
+/* Actual page data (contiguous block) */
+char	   *MapPageData = NULL;
+
+static void MapShmemRequest(void *arg);
+static void MapShmemInit(void *arg);
+static void MapShmemAttach(void *arg);
+
+const ShmemCallbacks MapShmemCallbacks = {
+	.request_fn = MapShmemRequest,
+	.init_fn = MapShmemInit,
+	.attach_fn = MapShmemAttach,
+};
+
+static void
+MapRefreshBufferSlots(void)
+{
+	int computed_slots = NBuffers >> 7;
+
+	if (computed_slots < 4096)
+		computed_slots = 4096;
+
+	map_buffers = computed_slots;
+}
+
+void
+MapBackendInit(void)
+{
+	static bool initialized = false;
+
+	if (initialized)
+		return;
+
+	MapRefreshBufferSlots();
+	MapEnsurePrivateRefCount();	initialized = true;
+}
+
+static void
+MapShmemRequest(void *arg)
+{
+	MapRefreshBufferSlots();
+
+	ShmemRequestStruct(.name = "Map Shared Data",
+					   .size = sizeof(MapSharedData),
+					   .ptr = (void **) &MapShared,
+		);
+
+	ShmemRequestStruct(.name = "Map Buffers",
+					   .size = map_buffers * sizeof(MapBufferDesc),
+					   .ptr = (void **) &MapBuffers,
+		);
+
+	ShmemRequestStruct(.name = "Map Page Data",
+					   .size = map_buffers * BLCKSZ,
+					   .ptr = (void **) &MapPageData,
+		);
+
+	MapCacheTableShmemRequest();
+	MapSuperTableShmemRequest();
+}
+
+/*
+ * Initialize shared memory for map layer during postmaster startup.
+ */
+static void
+MapShmemInit(void *arg)
+{
+	int			i;
+
+	MapShared->num_slots = map_buffers;
+	MapShared->first_free_buffer = 0;
+	pg_atomic_init_u32(&MapShared->next_victim_buffer, 0);
+	pg_atomic_init_u32(&MapShared->num_allocs, 0);
+	MapShared->complete_passes = 0;
+	SpinLockInit(&MapShared->clock_lock);
+
+	for (i = 0; i < map_buffers; i++)
+	{
+		MapBufferDesc *buf = &MapBuffers[i];
+
+		buf->id = i;
+		buf->freeNext = (i == map_buffers - 1) ? FREENEXT_END_OF_LIST : i + 1;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pid = 0;
+
+		memset(&buf->rnode, 0, sizeof(RelFileLocator));
+		buf->forknum = InvalidForkNumber;
+		buf->page_number = -1;
+		buf->page_lsn = 0;
+		LWLockInitialize(&buf->buffer_lock, LWTRANCHE_MAP_BUFFER_CONTENT);
+		LWLockInitialize(&buf->io_in_progress_lock, LWTRANCHE_MAP_BUFFER_CONTENT);
+	}
+
+	memset(MapPageData, 0, map_buffers * BLCKSZ);
+
+	MapCacheTableShmemInit();
+	MapSuperTableShmemInit();
+}
+
+static void
+MapShmemAttach(void *arg)
+{
+	Assert(MapShared != NULL);
+	Assert(MapBuffers != NULL);
+	Assert(MapPageData != NULL);
+	Assert(MapShared->num_slots == map_buffers);
+
+	MapSuperTableShmemAttach();
+}
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index b376d513fd..cf8bde182e 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -1,22 +1,98 @@
 /*-------------------------------------------------------------------------
  *
  * mapsuper.c
- *	  Umbra metadata superblock helpers.
- *
- * This file contains on-disk superblock encoding and direct metadata-file I/O
- * helpers.
- *
- * src/backend/storage/map/mapsuper.c
+ *	  MAP superblock metadata helpers.
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"

+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "common/hashfn.h"
+#include "miscadmin.h"
 #include "storage/map.h"
 #include "storage/mapsuper.h"
-#include "storage/umbra.h"
+#include "storage/mapsuper_internal.h"
+#include "storage/shmem.h"
+
+#define MAP_SUPER_NPARTITIONS		128
+#define MAP_SUPER_NPARTITION_BITS	7
+#define MAPSUPER_INDEX_EMPTY		(-1)
+#define MAPSUPER_INDEX_DELETED		(-2)
+#define MAPSUPER_FREENEXT_END		(-1)
+#define MAPSUPER_FREENEXT_NOT_IN_LIST (-2)
+
+#if MAP_SUPER_NPARTITIONS != (1 << MAP_SUPER_NPARTITION_BITS)
+#error "MAP_SUPER_NPARTITIONS must match MAP_SUPER_NPARTITION_BITS"
+#endif

-static void MapSBlockReportCorrupt(SMgrRelation reln, const char *reason);
+typedef struct MapSuperIndexSlot
+{
+	int			slot_id;
+} MapSuperIndexSlot;
+
+typedef struct MapSuperCtl
+{
+	int			free_head;
+	slock_t		free_list_lock;
+} MapSuperCtl;
+
+typedef enum MapSBlockReadStatus
+{
+	MAP_SBLOCK_READ_OK,
+	MAP_SBLOCK_READ_MISSING,
+	MAP_SBLOCK_READ_CORRUPT
+} MapSBlockReadStatus;
+
+MapSuperEntry *MapSuperEntries = NULL;
+int			MapSuperCapacity = 0;
+
+static MapSuperIndexSlot *MapSuperIndex = NULL;
+static MapSuperCtl *MapSuperCtlData = NULL;
+static LWLockPadded *MapSuperPartitionLocks = NULL;
+static int	MapSuperIndexCapacityPerPartition = 0;
+
+static void MapSuperTableRefreshDerivedState(void);
+static MapSBlockReadStatus MapSuperLoadFromDisk(UmbraFileContext *map_ctx,
+												RelFileLocator rnode,
+												MapSuperblock *super);
+static int	MapSuperIndexCapacityForPartition(int capacity);
+static uint32 MapSuperHashCode(RelFileLocator rnode);
+static int	MapSuperPartitionForHash(uint32 hashcode);
+static LWLock *MapSuperPartitionLock(uint32 hashcode);
+static int	MapSuperLookupSlotLocked(RelFileLocator rnode, uint32 hashcode,
+									 int partition, int *insert_bucket);
+static bool MapForkUsesAbsentSentinel(ForkNumber forknum);
+static uint32 MapSuperExtendingFlag(ForkNumber forknum);
+static BlockNumber MapSuperGetExtendingTarget(const MapSuperEntry *entry,
+											  ForkNumber forknum);
+static void MapSuperSetExtendingTarget(MapSuperEntry *entry,
+									   ForkNumber forknum,
+									   BlockNumber nblocks);
+static bool MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx,
+										  RelFileLocator rnode,
+										  XLogRecPtr map_lsn,
+										  const char *missing_errmsg,
+										  MapSuperEntry **entry_p);
+static void MapSBlockUpdateLogicalNblocks(UmbraFileContext *map_ctx,
+										  RelFileLocator rnode,
+										  ForkNumber forknum,
+										  BlockNumber nblocks,
+										  XLogRecPtr map_lsn,
+										  bool bump_only);
+static void MapSBlockSetPendingFlag(UmbraFileContext *map_ctx,
+									RelFileLocator rnode,
+									bool pending,
+									XLogRecPtr map_lsn);
+void MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx,
+								RelFileLocator rnode,
+								ForkNumber forknum,
+								BlockNumber nblocks,
+								bool bump_next_free,
+								bool bump_capacity,
+								XLogRecPtr map_lsn);

void
MapSuperblockRefreshCRC(MapSuperblock *super)
@@ -251,88 +327,1171 @@ MapSuperblockSetLogicalNblocks(MapSuperblock *super, ForkNumber forknum,
}

 void
-MapSuperblockPackPage(const MapSuperblock *super, char page[BLCKSZ])
+MapSuperblockPackSector(const MapSuperblock *super, char sector[MAP_SUPERBLOCK_SIZE])
 {
 	Assert(super != NULL);
-	Assert(page != NULL);
+	Assert(sector != NULL);

-	MemSet(page, 0, BLCKSZ);
-	memcpy(page, super->padding, MAP_SUPERBLOCK_SIZE);
+	memcpy(sector, super->padding, MAP_SUPERBLOCK_SIZE);
 }

 void
-MapSuperblockUnpackPage(MapSuperblock *super, const char page[BLCKSZ])
+MapSuperblockUnpackSector(MapSuperblock *super,
+						  const char sector[MAP_SUPERBLOCK_SIZE])
 {
 	Assert(super != NULL);
-	Assert(page != NULL);
+	Assert(sector != NULL);
+
+	memcpy(super->padding, sector, MAP_SUPERBLOCK_SIZE);
+}
+
+void
+MapSBlockReportCorrupt(RelFileLocator rnode, const char *reason)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_DATA_CORRUPTED),
+			 errmsg("map superblock is corrupted for relation %u/%u/%u: %s",
+					rnode.spcOid, rnode.dbOid, rnode.relNumber, reason)));
+}
+
+static MapSBlockReadStatus
+MapSuperLoadFromDisk(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					 MapSuperblock *super)
+{
+	char		sector[MAP_SUPERBLOCK_SIZE];
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return MAP_SBLOCK_READ_MISSING;
+
+	umfile_ctx_read(map_ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
+					sector, MAP_SUPERBLOCK_SIZE);
+	MapSuperblockUnpackSector(super, sector);
+
+	if (!MapSuperblockHasValidIdentity(super) ||
+		!MapSuperblockCheckCRC(super))
+		return MAP_SBLOCK_READ_CORRUPT;
+
+	return MAP_SBLOCK_READ_OK;
+}
+
+static uint32
+MapSuperHashCode(RelFileLocator rnode)
+{
+	return DatumGetUInt32(hash_any((const unsigned char *) &rnode,
+								   sizeof(RelFileLocator)));
+}
+
+static int
+MapSuperIndexCapacityForPartition(int capacity)
+{
+	int			index_capacity = 1;
+	long		total_target = (long) capacity * 2L;
+	long		per_partition_target;
+
+	per_partition_target =
+		(total_target + MAP_SUPER_NPARTITIONS - 1) / MAP_SUPER_NPARTITIONS;
+	while ((long) index_capacity < per_partition_target)
+		index_capacity <<= 1;
+
+	return index_capacity;
+}
+
+static int
+MapSuperPartitionForHash(uint32 hashcode)
+{
+	return hashcode & (MAP_SUPER_NPARTITIONS - 1);
+}
+
+static LWLock *
+MapSuperPartitionLock(uint32 hashcode)
+{
+	return &MapSuperPartitionLocks[MapSuperPartitionForHash(hashcode)].lock;
+}
+
+static int
+MapSuperLookupSlotLocked(RelFileLocator rnode, uint32 hashcode, int partition,
+						 int *insert_bucket)
+{
+	int			mask = MapSuperIndexCapacityPerPartition - 1;
+	int			base = partition * MapSuperIndexCapacityPerPartition;
+	int			bucket = (hashcode >> MAP_SUPER_NPARTITION_BITS) & mask;
+	int			first_deleted = -1;
+	int			probes;
+	LWLock	   *partition_lock = MapSuperPartitionLock(hashcode);
+
+	Assert(LWLockHeldByMe(partition_lock));
+
+	for (probes = 0; probes < MapSuperIndexCapacityPerPartition; probes++)
+	{
+		int			slot_id = MapSuperIndex[base + bucket].slot_id;
+
+		if (slot_id == MAPSUPER_INDEX_EMPTY)
+		{
+			if (insert_bucket != NULL)
+				*insert_bucket = (first_deleted >= 0) ?
+					(base + first_deleted) : (base + bucket);
+			return -1;
+		}
+
+		if (slot_id == MAPSUPER_INDEX_DELETED)
+		{
+			if (first_deleted < 0)
+				first_deleted = bucket;
+		}
+		else
+		{
+			MapSuperEntry *entry = MapSuperEntryBySlot(slot_id);
+
+			if (entry->in_use && RelFileLocatorEquals(entry->key.rnode, rnode))
+			{
+				if (insert_bucket != NULL)
+					*insert_bucket = base + bucket;
+				return slot_id;
+			}
+		}
+
+		bucket = (bucket + 1) & mask;
+	}
+
+	if (insert_bucket != NULL)
+		*insert_bucket = (first_deleted >= 0) ? (base + first_deleted) : -1;

-	memcpy(super->padding, page, MAP_SUPERBLOCK_SIZE);
+	return -1;
 }

 bool
-MapSBlockRead(SMgrRelation reln, MapSuperblock *super)
+MapSuperFindEntryLocked(RelFileLocator rnode, LWLockMode mode,
+						MapSuperEntry **entry)
+{
+	uint32		hashcode;
+	int			partition;
+	int			slot_id;
+	LWLock	   *partition_lock;
+
+	hashcode = MapSuperHashCode(rnode);
+	partition = MapSuperPartitionForHash(hashcode);
+	partition_lock = &MapSuperPartitionLocks[partition].lock;
+
+	LWLockAcquire(partition_lock, LW_SHARED);
+	slot_id = MapSuperLookupSlotLocked(rnode, hashcode, partition, NULL);
+	if (slot_id >= 0)
+	{
+		*entry = MapSuperEntryBySlot(slot_id);
+		LWLockAcquire(&(*entry)->lock, mode);
+		LWLockRelease(partition_lock);
+		return true;
+	}
+
+	LWLockRelease(partition_lock);
+	*entry = NULL;
+	return false;
+}
+
+bool
+MapSuperFindEntryTryLocked(RelFileLocator rnode, LWLockMode mode,
+						   MapSuperEntry **entry)
+{
+	uint32		hashcode;
+	int			partition;
+	int			slot_id;
+	LWLock	   *partition_lock;
+
+	hashcode = MapSuperHashCode(rnode);
+	partition = MapSuperPartitionForHash(hashcode);
+	partition_lock = &MapSuperPartitionLocks[partition].lock;
+
+	LWLockAcquire(partition_lock, LW_SHARED);
+	slot_id = MapSuperLookupSlotLocked(rnode, hashcode, partition, NULL);
+	if (slot_id >= 0)
+	{
+		*entry = MapSuperEntryBySlot(slot_id);
+		if (!LWLockConditionalAcquire(&(*entry)->lock, mode))
+		{
+			LWLockRelease(partition_lock);
+			*entry = NULL;
+			return false;
+		}
+		LWLockRelease(partition_lock);
+		return true;
+	}
+
+	LWLockRelease(partition_lock);
+	*entry = NULL;
+	return false;
+}
+
+MapSuperEntry *
+MapSuperEnsureEntryLocked(RelFileLocator rnode)
+{
+	MapSuperEntry *entry;
+	uint32		hashcode;
+	int			partition;
+	int			slot_id;
+	int			insert_bucket = -1;
+	LWLock	   *partition_lock;
+
+	hashcode = MapSuperHashCode(rnode);
+	partition = MapSuperPartitionForHash(hashcode);
+	partition_lock = &MapSuperPartitionLocks[partition].lock;
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+	slot_id = MapSuperLookupSlotLocked(rnode, hashcode, partition, &insert_bucket);
+	if (slot_id >= 0)
+	{
+		entry = MapSuperEntryBySlot(slot_id);
+		LWLockAcquire(&entry->lock, LW_EXCLUSIVE);
+		LWLockRelease(partition_lock);
+		return entry;
+	}
+
+	if (insert_bucket < 0)
+	{
+		LWLockRelease(partition_lock);
+		ereport(ERROR,
+				(errmsg("map superblock index table is full"),
+				 errhint("Increase map_superblocks and restart the server.")));
+	}
+
+	SpinLockAcquire(&MapSuperCtlData->free_list_lock);
+	slot_id = MapSuperCtlData->free_head;
+	if (slot_id == MAPSUPER_FREENEXT_END)
+	{
+		SpinLockRelease(&MapSuperCtlData->free_list_lock);
+		LWLockRelease(partition_lock);
+		ereport(ERROR,
+				(errmsg("map superblock slot table is full"),
+				 errhint("Increase map_superblocks and restart the server.")));
+	}
+
+	entry = MapSuperEntryBySlot(slot_id);
+	MapSuperCtlData->free_head = entry->next_free;
+	SpinLockRelease(&MapSuperCtlData->free_list_lock);
+
+	entry->next_free = MAPSUPER_FREENEXT_NOT_IN_LIST;
+	entry->in_use = true;
+	entry->key.rnode = rnode;
+	MemSet(&entry->super, 0, sizeof(entry->super));
+	entry->page_lsn = InvalidXLogRecPtr;
+	entry->flags = 0;
+	entry->runtime_flags = 0;
+	entry->reserved_next_free_main = 0;
+	entry->reserved_next_free_fsm = 0;
+	entry->reserved_next_free_vm = 0;
+	entry->extending_target_main = InvalidBlockNumber;
+	entry->extending_target_fsm = InvalidBlockNumber;
+	entry->extending_target_vm = InvalidBlockNumber;
+	MapSuperIndex[insert_bucket].slot_id = slot_id;
+
+	LWLockAcquire(&entry->lock, LW_EXCLUSIVE);
+	LWLockRelease(partition_lock);
+
+	return entry;
+}
+
+void
+MapSuperDeleteEntry(RelFileLocator rnode)
 {
-	char		page[BLCKSZ];
+	MapSuperEntry *entry = NULL;
+	uint32		hashcode;
+	int			partition;
+	int			slot_id;
+	int			bucket = -1;
+	LWLock	   *partition_lock;
+
+	hashcode = MapSuperHashCode(rnode);
+	partition = MapSuperPartitionForHash(hashcode);
+	partition_lock = &MapSuperPartitionLocks[partition].lock;

-	Assert(reln != NULL);
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+	slot_id = MapSuperLookupSlotLocked(rnode, hashcode, partition, &bucket);
+	if (slot_id >= 0)
+	{
+		entry = MapSuperEntryBySlot(slot_id);
+		LWLockAcquire(&entry->lock, LW_EXCLUSIVE);
+		entry->flags = 0;
+		entry->runtime_flags = 0;
+		entry->page_lsn = InvalidXLogRecPtr;
+		entry->reserved_next_free_main = 0;
+		entry->reserved_next_free_fsm = 0;
+		entry->reserved_next_free_vm = 0;
+		entry->extending_target_main = InvalidBlockNumber;
+		entry->extending_target_fsm = InvalidBlockNumber;
+		entry->extending_target_vm = InvalidBlockNumber;
+		entry->in_use = false;
+		SpinLockAcquire(&MapSuperCtlData->free_list_lock);
+		entry->next_free = MapSuperCtlData->free_head;
+		MapSuperCtlData->free_head = slot_id;
+		SpinLockRelease(&MapSuperCtlData->free_list_lock);
+		LWLockRelease(&entry->lock);
+		MapSuperIndex[bucket].slot_id = MAPSUPER_INDEX_DELETED;
+	}
+	LWLockRelease(partition_lock);
+}
+
+static MapSBlockReadStatus
+MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *super)
+{
+	MapSuperEntry *entry;
+	MapSBlockReadStatus status = MAP_SBLOCK_READ_OK;
+	MapSuperblock	disk_super;
+
+	Assert(map_ctx != NULL);
 	Assert(super != NULL);

-	if (!UmMetadataExists(reln))
-		return false;
+	if (!MapSuperFindEntryLocked(rnode, LW_SHARED, &entry))
+	{
+		status = MapSuperLoadFromDisk(map_ctx, rnode, &disk_super);
+		if (status == MAP_SBLOCK_READ_MISSING)
+			return MAP_SBLOCK_READ_MISSING;
+
+		entry = MapSuperEnsureEntryLocked(rnode);
+		if ((entry->flags & MAPSUPER_FLAG_VALID) == 0)
+		{
+			if (status == MAP_SBLOCK_READ_OK)
+			{
+				entry->super = disk_super;
+				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
+				entry->flags = MAPSUPER_FLAG_VALID;
+				MapSuperResetReservedNextFrees(entry);
+			}
+			else
+			{
+				MapSuperblockInit(&entry->super, 0);
+				entry->page_lsn = InvalidXLogRecPtr;
+				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+				MapSuperResetReservedNextFrees(entry);
+			}
+		}
+		else if (entry->flags & MAPSUPER_FLAG_CORRUPT)
+			status = MAP_SBLOCK_READ_CORRUPT;
+		else
+			status = MAP_SBLOCK_READ_OK;
+	}
+	else if ((entry->flags & MAPSUPER_FLAG_VALID) == 0)
+	{
+		LWLockRelease(&entry->lock);
+		status = MapSuperLoadFromDisk(map_ctx, rnode, &disk_super);
+		if (status == MAP_SBLOCK_READ_MISSING)
+			return MAP_SBLOCK_READ_MISSING;
+
+		entry = MapSuperEnsureEntryLocked(rnode);
+		if ((entry->flags & MAPSUPER_FLAG_VALID) == 0)
+		{
+			if (status == MAP_SBLOCK_READ_OK)
+			{
+				entry->super = disk_super;
+				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
+				entry->flags = MAPSUPER_FLAG_VALID;
+				MapSuperResetReservedNextFrees(entry);
+			}
+			else
+			{
+				MapSuperblockInit(&entry->super, 0);
+				entry->page_lsn = InvalidXLogRecPtr;
+				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+				MapSuperResetReservedNextFrees(entry);
+			}
+		}
+		else if (entry->flags & MAPSUPER_FLAG_CORRUPT)
+			status = MAP_SBLOCK_READ_CORRUPT;
+		else
+			status = MAP_SBLOCK_READ_OK;
+	}
+	else
+	{
+		/*
+		 * Once a superblock is loaded into a valid shared entry, hot reads
+		 * should consume that runtime state directly. Disk identity/CRC
+		 * validation belongs to the slow path that populates shared state.
+		 */
+		*super = entry->super;
+		status = (entry->flags & MAPSUPER_FLAG_CORRUPT) ?
+			MAP_SBLOCK_READ_CORRUPT : MAP_SBLOCK_READ_OK;
+		LWLockRelease(&entry->lock);
+		return status;
+	}
+
+	switch (status)
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+			LWLockRelease(&entry->lock);
+			return MAP_SBLOCK_READ_MISSING;
+		case MAP_SBLOCK_READ_CORRUPT:
+			LWLockRelease(&entry->lock);
+			return MAP_SBLOCK_READ_CORRUPT;
+	}
+
+	*super = entry->super;
+	LWLockRelease(&entry->lock);
+	return MAP_SBLOCK_READ_OK;
+}
+
+bool
+MapForkHasMappedState(ForkNumber forknum)
+{
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+		case FSM_FORKNUM:
+		case VISIBILITYMAP_FORKNUM:
+			return true;
+		default:
+			return false;
+	}
+}
+
+static bool
+MapForkUsesAbsentSentinel(ForkNumber forknum)
+{
+	switch (forknum)
+	{
+		case FSM_FORKNUM:
+		case VISIBILITYMAP_FORKNUM:
+			return true;
+		default:
+			return false;
+	}
+}

-	if (UmMetadataNblocks(reln) == 0)
+BlockNumber
+MapNormalizeForkBlockCount(ForkNumber forknum, BlockNumber raw)
+{
+	if (MapForkUsesAbsentSentinel(forknum) &&
+		raw == InvalidBlockNumber)
+		return 0;
+
+	return raw;
+}
+
+bool
+MapSuperForkExists(const MapSuperblock *super, ForkNumber forknum)
+{
+	if (!MapForkHasMappedState(forknum))
 		return false;

-	UmMetadataRead(reln, MAP_BLOCK_SUPER, page);
-	MapSuperblockUnpackPage(super, page);
+	if (!MapForkUsesAbsentSentinel(forknum))
+		return true;

-	if (!MapSuperblockHasValidIdentity(super))
-		MapSBlockReportCorrupt(reln, "invalid identity");
-	if (!MapSuperblockCheckCRC(super))
-		MapSBlockReportCorrupt(reln, "CRC mismatch");
+	return MapSuperblockGetLogicalNblocks(super, forknum) != InvalidBlockNumber;
+}
+
+
+static uint32
+MapSuperExtendingFlag(ForkNumber forknum)
+{
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_EXTENDING_MAIN;
+		case FSM_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_EXTENDING_FSM;
+		case VISIBILITYMAP_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_EXTENDING_VM;
+		default:
+			return 0;
+	}
+}

+static BlockNumber
+MapSuperGetExtendingTarget(const MapSuperEntry *entry, ForkNumber forknum)
+{
+	Assert(entry != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return entry->extending_target_main;
+		case FSM_FORKNUM:
+			return entry->extending_target_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return entry->extending_target_vm;
+		default:
+			return InvalidBlockNumber;
+	}
+}
+
+static void
+MapSuperSetExtendingTarget(MapSuperEntry *entry, ForkNumber forknum,
+						   BlockNumber nblocks)
+{
+	Assert(entry != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			entry->extending_target_main = nblocks;
+			break;
+		case FSM_FORKNUM:
+			entry->extending_target_fsm = nblocks;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			entry->extending_target_vm = nblocks;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for extend target: %d", forknum);
+	}
+}
+
+
+
+
+
+static bool
+MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							  XLogRecPtr map_lsn, const char *missing_errmsg,
+							  MapSuperEntry **entry_p)
+{
+	MapSuperEntry *entry;
+	uint32			flags;
+
+	Assert(map_ctx != NULL);
+	Assert(entry_p != NULL);
+
+	if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+	{
+		MapSuperblock	disk_super;
+		MapSBlockReadStatus status;
+
+		status = MapSuperLoadFromDisk(map_ctx, rnode, &disk_super);
+		if (status == MAP_SBLOCK_READ_MISSING)
+		{
+			if (InRecovery)
+				return false;
+			elog(ERROR, "%s", missing_errmsg);
+		}
+
+		entry = MapSuperEnsureEntryLocked(rnode);
+		if ((entry->flags & MAPSUPER_FLAG_VALID) == 0)
+		{
+			if (status == MAP_SBLOCK_READ_OK)
+			{
+				entry->super = disk_super;
+				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
+				entry->flags = MAPSUPER_FLAG_VALID;
+			}
+			else
+			{
+				MapSuperblockInit(&entry->super, 0);
+				entry->page_lsn = InvalidXLogRecPtr;
+				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+			}
+		}
+	}
+
+	flags = entry->flags;
+
+	if ((flags & MAPSUPER_FLAG_CORRUPT) ||
+		!MapSuperblockHasValidIdentity(&entry->super) ||
+		((flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+		 !MapSuperblockCheckCRC(&entry->super)))
+	{
+		if (!InRecovery || map_lsn == InvalidXLogRecPtr)
+			MapSBlockReportCorrupt(rnode, "invalid identity or CRC");
+
+		/*
+		 * Update paths rebuild superblock state from WAL-backed metadata.
+		 * Never continue from an untrusted superblock image.
+		 */
+		MapSuperblockInit(&entry->super, 0);
+		entry->flags = MAPSUPER_FLAG_VALID;
+	}
+
+	*entry_p = entry;
 	return true;
 }

+static void
+MapSBlockUpdateLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							  ForkNumber forknum, BlockNumber nblocks,
+							  XLogRecPtr map_lsn, bool bump_only)
+{
+	MapSuperEntry *entry;
+	BlockNumber		current;
+
+	if (!MapForkHasMappedState(forknum))
+		return;
+
+	if (!MapSuperPrepareEntryForUpdate(map_ctx, rnode, map_lsn,
+									   "MAP fork is missing while updating superblock",
+									   &entry))
+		return;
+
+	current = MapSuperblockGetLogicalNblocks(&entry->super, forknum);
+	current = MapNormalizeForkBlockCount(forknum, current);
+	if (!bump_only || current < nblocks)
+		MapSuperblockSetLogicalNblocks(&entry->super, forknum, nblocks);
+
+	if (!bump_only || current < nblocks)
+	{
+		if (map_lsn == InvalidXLogRecPtr)
+		{
+			if (InRecovery)
+				map_lsn = GetXLogReplayRecPtr(NULL);
+			else
+				map_lsn = GetXLogWriteRecPtr();
+		}
+		MapSuperblockSetLastUpdatedLSN(&entry->super, map_lsn);
+		entry->page_lsn = map_lsn;
+		entry->flags |= MAPSUPER_FLAG_DIRTY;
+	}
+
+	LWLockRelease(&entry->lock);
+}
+
+static void
+MapSBlockSetPendingFlag(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						bool pending, XLogRecPtr map_lsn)
+{
+	MapSuperEntry *entry;
+	uint32		super_flags;
+
+	if (!MapSuperPrepareEntryForUpdate(map_ctx, rnode, map_lsn,
+									   "MAP fork is missing while updating superblock state",
+									   &entry))
+		return;
+
+	super_flags = MapSuperblockGetFlags(&entry->super);
+	if (pending)
+		super_flags |= MAP_SUPERBLOCK_FLAG_SKIP_WAL_PENDING;
+	else
+		super_flags &= ~MAP_SUPERBLOCK_FLAG_SKIP_WAL_PENDING;
+
+	if (super_flags != MapSuperblockGetFlags(&entry->super))
+	{
+		if (map_lsn == InvalidXLogRecPtr)
+		{
+			if (InRecovery)
+				map_lsn = GetXLogReplayRecPtr(NULL);
+			else
+				map_lsn = GetXLogWriteRecPtr();
+		}
+
+		MapSuperblockSetFlags(&entry->super, super_flags);
+		MapSuperblockSetLastUpdatedLSN(&entry->super, map_lsn);
+		entry->page_lsn = map_lsn;
+		entry->flags |= MAPSUPER_FLAG_DIRTY;
+	}
+
+	LWLockRelease(&entry->lock);
+}
+
 void
-MapSBlockWrite(SMgrRelation reln, const MapSuperblock *super, bool skipFsync)
+MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						   ForkNumber forknum, BlockNumber nblocks,
+						   bool bump_next_free, bool bump_capacity,
+						   XLogRecPtr map_lsn)
 {
-	MapSuperblock write_super;
-	char		page[BLCKSZ];
+	MapSuperEntry *entry;
+	BlockNumber		current_next;
+	BlockNumber		current_capacity;
+	bool			changed = false;

-	Assert(reln != NULL);
-	Assert(super != NULL);
+	if (!MapForkHasMappedState(forknum))
+		return;
+
+	if (!MapSuperPrepareEntryForUpdate(map_ctx, rnode, map_lsn,
+									   "MAP fork is missing while updating superblock",
+									   &entry))
+		return;
+
+	current_next = MapSuperblockGetNextFreePhysBlock(&entry->super, forknum);
+	current_capacity = MapSuperblockGetPhysCapacity(&entry->super, forknum);
+	current_next = MapNormalizeForkBlockCount(forknum, current_next);
+	current_capacity = MapNormalizeForkBlockCount(forknum, current_capacity);
+
+	if (bump_next_free && current_next < nblocks)
+	{
+		MapSuperblockSetNextFreePhysBlock(&entry->super, forknum, nblocks);
+		if (InRecovery)
+		changed = true;
+	}
+	if (bump_capacity && current_capacity < nblocks)
+	{
+		MapSuperblockSetPhysCapacity(&entry->super, forknum, nblocks);
+		changed = true;
+	}
+
+	if (changed)
+	{
+		if (map_lsn == InvalidXLogRecPtr)
+		{
+			if (InRecovery)
+				map_lsn = GetXLogReplayRecPtr(NULL);
+			else
+				map_lsn = GetXLogWriteRecPtr();
+		}
+		MapSuperblockSetLastUpdatedLSN(&entry->super, map_lsn);
+		entry->page_lsn = map_lsn;
+		entry->flags |= MAPSUPER_FLAG_DIRTY;
+	}
+
+	LWLockRelease(&entry->lock);
+}
+
+bool
+MapSBlockEnsurePhysicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							   ForkNumber forknum, BlockNumber nblocks,
+							   bool skipFsync)
+{
+	MapSuperEntry *entry;
+	uint32		extend_flag;
+	BlockNumber	current;
+	BlockNumber	desired;
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	if (nblocks == 0)
+		return true;
+
+	if (!MapSBlockEnsureLoaded(map_ctx, rnode))
+		return false;
+
+	extend_flag = MapSuperExtendingFlag(forknum);
+	Assert(extend_flag != 0);
+
+retry:
+	if (!MapSuperPrepareEntryForUpdate(map_ctx, rnode, InvalidXLogRecPtr,
+									   "MAP fork is missing while materializing physical blocks",
+									   &entry))
+		return false;
+
+	current = MapSuperblockGetPhysCapacity(&entry->super, forknum);
+	current = MapNormalizeForkBlockCount(forknum, current);
+	if (current >= nblocks)
+	{
+		LWLockRelease(&entry->lock);
+		return true;
+	}
+
+	if ((entry->runtime_flags & extend_flag) != 0)
+	{
+		if (MapSuperGetExtendingTarget(entry, forknum) < nblocks)
+			MapSuperSetExtendingTarget(entry, forknum, nblocks);
+		LWLockRelease(&entry->lock);
+		pg_usleep(1000L);
+
+		CHECK_FOR_INTERRUPTS();
+		goto retry;
+	}
+
+	entry->runtime_flags |= extend_flag;
+	MapSuperSetExtendingTarget(entry, forknum, nblocks);
+	LWLockRelease(&entry->lock);
+
+	PG_TRY();
+	{
+		desired = nblocks;
+
+		for (;;)
+		{
+			BlockNumber blk;
+
+			for (blk = current; blk < desired; blk++)
+			{
+				if (!umfile_ctx_block_exists(map_ctx, forknum, blk))
+					umfile_zeroextend(map_ctx, forknum, blk, 1, skipFsync);
+			}
+
+			if (!MapSuperPrepareEntryForUpdate(map_ctx, rnode, InvalidXLogRecPtr,
+											   "MAP fork is missing while materializing physical blocks",
+											   &entry))
+				elog(ERROR,
+					 "MAP fork disappeared while materializing relation %u/%u/%u fork %d",
+					 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum);
+
+			current = MapSuperblockGetPhysCapacity(&entry->super, forknum);
+			current = MapNormalizeForkBlockCount(forknum, current);
+			if (current < desired)
+			{
+				XLogRecPtr	map_lsn;
+
+				if (InRecovery)
+					map_lsn = GetXLogReplayRecPtr(NULL);
+				else
+					map_lsn = GetXLogWriteRecPtr();
+
+				MapSuperblockSetPhysCapacity(&entry->super, forknum, desired);
+				MapSuperblockSetLastUpdatedLSN(&entry->super, map_lsn);
+				entry->page_lsn = map_lsn;
+				entry->flags |= MAPSUPER_FLAG_DIRTY;
+				current = desired;
+			}
+
+			desired = Max(desired, MapSuperGetExtendingTarget(entry, forknum));
+			if (current >= desired)
+			{
+				entry->runtime_flags &= ~extend_flag;
+				MapSuperSetExtendingTarget(entry, forknum, InvalidBlockNumber);
+				LWLockRelease(&entry->lock);
+				return true;
+			}
+
+			MapSuperSetExtendingTarget(entry, forknum, desired);
+			LWLockRelease(&entry->lock);
+		}
+	}
+	PG_CATCH();
+	{
+		if (MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+		{
+			entry->runtime_flags &= ~extend_flag;
+			MapSuperSetExtendingTarget(entry, forknum, InvalidBlockNumber);
+			LWLockRelease(&entry->lock);
+		}
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	return false;
+}
+
+void
+MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode, XLogRecPtr map_lsn)
+{
+	MapSuperEntry *entry;
+	MapSuperblock	super;
+	MapSuperblock	write_super;
+	char		sector[MAP_SUPERBLOCK_SIZE];
+	XLogRecPtr	write_lsn;
+
+	Assert(map_ctx != NULL);
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		elog(ERROR, "MAP fork is missing while initializing superblock");
+
+	entry = MapSuperEnsureEntryLocked(rnode);
+
+	MapSuperblockInit(&super, 0);
+
+	entry->super = super;
+	entry->page_lsn = (map_lsn != InvalidXLogRecPtr) ?
+		map_lsn : GetXLogWriteRecPtr();
+	MapSuperblockSetLastUpdatedLSN(&entry->super, entry->page_lsn);
+	entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_DIRTY;
+
+	/*
+	 * Persist superblock immediately so later backends in bootstrap/initdb can
+	 * read block 0 even before checkpoint gets a chance to flush.
+	 * This keeps create-time O(1): only one 512-byte sector is written.
+	 */
+	write_super = entry->super;
+	write_lsn = entry->page_lsn;
+	LWLockRelease(&entry->lock);
+
+	if (!InRecovery && write_lsn != InvalidXLogRecPtr)
+		XLogFlush(write_lsn);

-	write_super = *super;
 	MapSuperblockRefreshCRC(&write_super);
-	MapSuperblockPackPage(&write_super, page);
+	MapSuperblockPackSector(&write_super, sector);
+	umfile_ctx_write(map_ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
+					 sector, MAP_SUPERBLOCK_SIZE, false);
+	umfile_ctx_register_dirty(map_ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
+							  false, false);
+}
+
+bool
+MapSBlockEnsureLoaded(UmbraFileContext *map_ctx, RelFileLocator rnode)
+{
+	MapSuperEntry *entry;

-	if (!UmMetadataOpenOrCreate(reln, false, NULL))
-		elog(ERROR, "could not open Umbra metadata file for superblock write");
+	Assert(map_ctx != NULL);

-	if (UmMetadataNblocks(reln) == 0)
-		UmMetadataExtend(reln, MAP_BLOCK_SUPER, page, skipFsync);
-	else
-		UmMetadataWrite(reln, MAP_BLOCK_SUPER, page, skipFsync);
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return false;
+
+	if (!MapSuperFindEntryLocked(rnode, LW_SHARED, &entry))
+	{
+		MapSuperblock	disk_super;
+		MapSBlockReadStatus status;
+
+		status = MapSuperLoadFromDisk(map_ctx, rnode, &disk_super);
+		if (status == MAP_SBLOCK_READ_MISSING)
+			return false;
+
+		entry = MapSuperEnsureEntryLocked(rnode);
+		if ((entry->flags & MAPSUPER_FLAG_VALID) == 0)
+		{
+			if (status == MAP_SBLOCK_READ_OK)
+			{
+				entry->super = disk_super;
+				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
+				entry->flags = MAPSUPER_FLAG_VALID;
+			}
+			else
+			{
+				MapSuperblockInit(&entry->super, 0);
+				entry->page_lsn = InvalidXLogRecPtr;
+				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+			}
+		}
+	}
+
+	LWLockRelease(&entry->lock);
+	return true;
+}
+
+bool
+MapSBlockTryGetLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							  ForkNumber forknum,
+							  BlockNumber *nblocks)
+{
+	MapSuperblock super;
+
+	Assert(nblocks != NULL);
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	switch (MapSBlockRead(map_ctx, rnode, &super))
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+			return false;
+		case MAP_SBLOCK_READ_CORRUPT:
+			if (!InRecovery)
+				MapSBlockReportCorrupt(rnode, "invalid identity/CRC or short file");
+			return false;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&super))
+		return false;
+
+	*nblocks = MapNormalizeForkBlockCount(forknum,
+										  MapSuperblockGetLogicalNblocks(&super, forknum));
+	return true;
+}
+
+bool
+MapSBlockForkExists(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					ForkNumber forknum)
+{
+	MapSuperblock super;
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	switch (MapSBlockRead(map_ctx, rnode, &super))
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+		case MAP_SBLOCK_READ_CORRUPT:
+			return false;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&super))
+		return false;
+
+	return MapSuperForkExists(&super, forknum);
+}
+
+bool
+MapSBlockTryGetPhysicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							   ForkNumber forknum, BlockNumber *nblocks)
+{
+	MapSuperblock super;
+
+	Assert(nblocks != NULL);
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	switch (MapSBlockRead(map_ctx, rnode, &super))
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+			return false;
+		case MAP_SBLOCK_READ_CORRUPT:
+			if (!InRecovery)
+				MapSBlockReportCorrupt(rnode, "invalid identity/CRC or short file");
+			return false;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&super))
+		return false;
+
+	*nblocks = MapNormalizeForkBlockCount(forknum,
+										  MapSuperblockGetPhysCapacity(&super, forknum));
+	return true;
+}
+
+bool
+MapSBlockTryGetNextFreePhysBlock(UmbraFileContext *map_ctx, RelFileLocator rnode,
+								 ForkNumber forknum, BlockNumber *next_free_pblk)
+{
+	MapSuperblock super;
+
+	Assert(next_free_pblk != NULL);
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	switch (MapSBlockRead(map_ctx, rnode, &super))
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+			return false;
+		case MAP_SBLOCK_READ_CORRUPT:
+			if (!InRecovery)
+				MapSBlockReportCorrupt(rnode, "invalid identity/CRC or short file");
+			return false;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&super))
+		return false;
+
+	*next_free_pblk = MapNormalizeForkBlockCount(forknum,
+												 MapSuperblockGetNextFreePhysBlock(&super, forknum));
+	return true;
+}
+
+
+
+void
+MapSBlockBumpLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							ForkNumber forknum, BlockNumber nblocks,
+							XLogRecPtr map_lsn)
+{
+	MapSBlockUpdateLogicalNblocks(map_ctx, rnode, forknum, nblocks,
+								  map_lsn, true);
+}
+
+void
+MapSBlockBumpPhysicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							 ForkNumber forknum, BlockNumber nblocks,
+							 XLogRecPtr map_lsn)
+{
+	MapSBlockBumpPhysicalState(map_ctx, rnode, forknum, nblocks,
+							   false, true, map_lsn);
 }

 void
-MapSBlockInitNew(SMgrRelation reln, uint32 flags, XLogRecPtr lsn, bool skipFsync)
+MapSBlockBumpNextFreePhysBlock(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							   ForkNumber forknum, BlockNumber next_free_pblk,
+							   XLogRecPtr map_lsn)
+{
+	MapSBlockBumpPhysicalState(map_ctx, rnode, forknum, next_free_pblk,
+							   true, false, map_lsn);
+}
+
+void
+MapSBlockSetLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						   ForkNumber forknum, BlockNumber nblocks,
+						   XLogRecPtr map_lsn)
+{
+	MapSBlockUpdateLogicalNblocks(map_ctx, rnode, forknum, nblocks,
+								  map_lsn, false);
+}
+
+void
+MapSBlockSetSkipWalPending(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						   bool pending, XLogRecPtr map_lsn)
+{
+	MapSBlockSetPendingFlag(map_ctx, rnode, pending, map_lsn);
+}
+
+bool
+MapSBlockIsSkipWalPending(UmbraFileContext *map_ctx, RelFileLocator rnode)
 {
 	MapSuperblock super;

-	MapSuperblockInit(&super, flags);
-	MapSuperblockSetLastUpdatedLSN(&super, lsn);
-	MapSBlockWrite(reln, &super, skipFsync);
+	switch (MapSBlockRead(map_ctx, rnode, &super))
+	{
+		case MAP_SBLOCK_READ_OK:
+			break;
+		case MAP_SBLOCK_READ_MISSING:
+		case MAP_SBLOCK_READ_CORRUPT:
+			return false;
+	}
+
+	if (!MapSuperblockHasValidIdentity(&super))
+		return false;
+
+	return (MapSuperblockGetFlags(&super) &
+			MAP_SUPERBLOCK_FLAG_SKIP_WAL_PENDING) != 0;
 }

 static void
-MapSBlockReportCorrupt(SMgrRelation reln, const char *reason)
+MapSuperTableRefreshDerivedState(void)
 {
-	RelFileLocator rlocator = reln->smgr_rlocator.locator;
+	MapSuperCapacity = Max(map_superblocks, MAP_SUPERBLOCK_MIN_ENTRIES);
+	MapSuperIndexCapacityPerPartition =
+		MapSuperIndexCapacityForPartition(MapSuperCapacity);
+}

-	ereport(ERROR,
-			(errcode(ERRCODE_DATA_CORRUPTED),
-			 errmsg("Umbra metadata superblock is corrupted for relation %u/%u/%u: %s",
-					rlocator.spcOid, rlocator.dbOid, rlocator.relNumber, reason)));
+void
+MapSuperTableShmemRequest(void)
+{
+	int			total_index_slots;
+
+	MapSuperTableRefreshDerivedState();
+	total_index_slots =
+		MapSuperIndexCapacityPerPartition * MAP_SUPER_NPARTITIONS;
+
+	ShmemRequestStruct(.name = "Map Superblock Table Ctl",
+					   .size = sizeof(MapSuperCtl),
+					   .ptr = (void **) &MapSuperCtlData,
+		);
+
+	ShmemRequestStruct(.name = "Map Superblock Partition Locks",
+					   .size = MAP_SUPER_NPARTITIONS * sizeof(LWLockPadded),
+					   .ptr = (void **) &MapSuperPartitionLocks,
+		);
+
+	ShmemRequestStruct(.name = "Map Superblock Table Entries",
+					   .size = MapSuperCapacity * sizeof(MapSuperEntry),
+					   .ptr = (void **) &MapSuperEntries,
+		);
+
+	ShmemRequestStruct(.name = "Map Superblock Table Index",
+					   .size = total_index_slots * sizeof(MapSuperIndexSlot),
+					   .ptr = (void **) &MapSuperIndex,
+		);
+}
+
+void
+MapSuperTableShmemInit(void)
+{
+	int			total_index_slots;
+	int			i;
+
+	MapSuperTableRefreshDerivedState();
+	total_index_slots = MapSuperIndexCapacityPerPartition * MAP_SUPER_NPARTITIONS;
+
+	for (i = 0; i < MAP_SUPER_NPARTITIONS; i++)
+		LWLockInitialize(&MapSuperPartitionLocks[i].lock,
+						 LWTRANCHE_MAP_BUFFER_CONTENT);
+
+	MapSuperCtlData->free_head = 0;
+	SpinLockInit(&MapSuperCtlData->free_list_lock);
+	for (i = 0; i < MapSuperCapacity; i++)
+	{
+		MapSuperEntry *entry = &MapSuperEntries[i];
+
+		MemSet(entry, 0, sizeof(*entry));
+		entry->next_free =
+			(i == MapSuperCapacity - 1) ? MAPSUPER_FREENEXT_END : (i + 1);
+		entry->in_use = false;
+		entry->extending_target_main = InvalidBlockNumber;
+		entry->extending_target_fsm = InvalidBlockNumber;
+		entry->extending_target_vm = InvalidBlockNumber;
+		LWLockInitialize(&entry->lock, LWTRANCHE_MAP_BUFFER_CONTENT);
+	}
+
+	for (i = 0; i < total_index_slots; i++)
+		MapSuperIndex[i].slot_id = MAPSUPER_INDEX_EMPTY;
+}
+
+void
+MapSuperTableShmemAttach(void)
+{
+	MapSuperTableRefreshDerivedState();
 }
diff --git a/src/backend/storage/map/meson.build b/src/backend/storage/map/meson.build
index 0f780fe522..8747f0b714 100644
--- a/src/backend/storage/map/meson.build
+++ b/src/backend/storage/map/meson.build
@@ -2,5 +2,9 @@

 backend_sources += files(
   'map.c',
+  'mapinit.c',
+  'mapbuf.c',
+  'mapflush.c',
+  'mapclock.c',
   'mapsuper.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c9a3ef6461..631d09d4b4 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -134,6 +134,13 @@ typedef struct f_smgr
 	void		(*smgr_sync_relation_metadata) (SMgrRelation reln);
 	void		(*smgr_unlink_relation_metadata) (RelFileLocatorBackend rlocator,
 												  bool isRedo);
+	bool		(*smgr_createdb_allows_wal_log) (void);
+	void		(*smgr_checkpoint_database_tablespaces) (Oid dbid,
+														 int ntablespaces,
+														 const Oid *tablespace_ids);
+	void		(*smgr_invalidate_database_tablespaces) (Oid dbid,
+														 int ntablespaces,
+														 const Oid *tablespace_ids);
 	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;

@@ -172,6 +179,9 @@ static const f_smgr smgrsw[] = {
 		.smgr_copy_relation_metadata = NULL,
 		.smgr_sync_relation_metadata = NULL,
 		.smgr_unlink_relation_metadata = NULL,
+		.smgr_createdb_allows_wal_log = NULL,
+		.smgr_checkpoint_database_tablespaces = NULL,
+		.smgr_invalidate_database_tablespaces = NULL,
 		.smgr_fd = mdfd,
 	},
 #ifdef USE_UMBRA
@@ -201,6 +211,9 @@ static const f_smgr smgrsw[] = {
 		.smgr_copy_relation_metadata = umcopyrelationmetadata,
 		.smgr_sync_relation_metadata = umsyncrelationmetadata,
 		.smgr_unlink_relation_metadata = umunlinkrelationmetadata,
+		.smgr_createdb_allows_wal_log = umcreatedballowswallog,
+		.smgr_checkpoint_database_tablespaces = umcheckpointdatabasetablespaces,
+		.smgr_invalidate_database_tablespaces = uminvalidatedatabasetablespaces,
 		.smgr_fd = umfd,
 	},
 #endif
@@ -569,8 +582,43 @@ smgrsyncrelationmetadata(SMgrRelation reln)
 void
 smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
 {
-	if (smgrsw[0].smgr_unlink_relation_metadata)
-		smgrsw[0].smgr_unlink_relation_metadata(rlocator, isRedo);
+	if (smgrsw[SMGR_DEFAULT].smgr_unlink_relation_metadata)
+		smgrsw[SMGR_DEFAULT].smgr_unlink_relation_metadata(rlocator, isRedo);
+}
+
+bool
+smgrcreatedballowswallog(void)
+{
+	if (smgrsw[SMGR_DEFAULT].smgr_createdb_allows_wal_log)
+		return smgrsw[SMGR_DEFAULT].smgr_createdb_allows_wal_log();
+
+	return true;
+}
+
+void
+smgrcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
+								  const Oid *tablespace_ids)
+{
+	if (smgrsw[SMGR_DEFAULT].smgr_checkpoint_database_tablespaces)
+		smgrsw[SMGR_DEFAULT].smgr_checkpoint_database_tablespaces(dbid,
+																  ntablespaces,
+																  tablespace_ids);
+}
+
+void
+smgrinvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
+								  const Oid *tablespace_ids)
+{
+	if (smgrsw[SMGR_DEFAULT].smgr_invalidate_database_tablespaces)
+		smgrsw[SMGR_DEFAULT].smgr_invalidate_database_tablespaces(dbid,
+																  ntablespaces,
+																  tablespace_ids);
+}
+
+void
+smgrinvalidatedatabase(Oid dbid)
+{
+	smgrinvalidatedatabasetablespaces(dbid, 0, NULL);
 }
 /*
  * smgrdosyncall() -- Immediately sync all forks of all given relations
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index fc6e480276..bbb870ab8e 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -3,10 +3,9 @@
  * umbra.c
  *	  Umbra storage manager skeleton.
  *
- * This file establishes Umbra as a separate smgr implementation from md.c.
- * maintains identity mapping state (logical block number == physical block
- * number) in the relation-local metadata file while using md.c for data-fork
- * I/O and umfile for metadata-file I/O.
+ * This file establishes Umbra as a separate smgr implementation from md.c. It
+ * maintains relation-local metadata and MAP checkpoint/cache state while using
+ * md.c for data-fork I/O and umfile for metadata-file I/O.
  *
  * src/backend/storage/smgr/umbra.c
  *
@@ -14,13 +13,17 @@
  */
 #include "postgres.h"

+#include "access/xlogutils.h"
 #include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufmgr.h"
+#include "storage/map.h"
 #include "storage/md.h"
-#include "storage/mapsuper.h"
 #include "storage/smgr.h"
 #include "storage/umfile.h"
 #include "storage/umbra.h"
 #include "utils/memutils.h"
+#include "utils/wait_event.h"

typedef struct UmbraSmgrRelationState
{
@@ -29,9 +32,11 @@ typedef struct UmbraSmgrRelationState

 static bool um_tracks_identity_metadata(ForkNumber forknum);
 static UmbraFileContext *um_relation_filectx(SMgrRelation reln);
+static void um_ensure_redo_metadata(SMgrRelation reln, ForkNumber forknum);
 static void um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
-										BlockNumber nblocks, bool fork_exists,
-										bool skipFsync);
+										BlockNumber nblocks, bool fork_exists);
+static void um_refresh_identity_metadata(SMgrRelation reln);
+static void um_filetag_path(const FileTag *ftag, char *path);

 bool
 UmMetadataExists(SMgrRelation reln)
@@ -72,11 +77,30 @@ void
 UmMetadataWrite(SMgrRelation reln, BlockNumber blkno, const void *buffer,
 				bool skipFsync)
 {
-	const void *buffers[1];
+	UmbraFileContext *ctx = um_relation_filectx(reln);

-	buffers[0] = buffer;
-	umfile_writev(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
-				  buffers, 1, skipFsync);
+	umfile_ctx_write(ctx, UMBRA_METADATA_FORKNUM, blkno,
+					 buffer, BLCKSZ, skipFsync);
+	umfile_ctx_register_dirty(ctx, UMBRA_METADATA_FORKNUM, blkno,
+							  skipFsync,
+							  RelFileLocatorBackendIsTemp(reln->smgr_rlocator));
+}
+
+void
+UmMetadataWriteSuperblock(RelFileLocatorBackend rlocator, const void *sector,
+						  bool skipFsync)
+{
+	UmbraFileContext *ctx = umfile_ctx_acquire(rlocator);
+
+	/*
+	 * Superblock checkpoint flush can run while holding MapSuperEntry->lock,
+	 * so it must not recurse through smgr/umopen.
+	 */
+	umfile_ctx_write(ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
+					 sector, MAP_SUPERBLOCK_SIZE, skipFsync);
+	umfile_ctx_register_dirty(ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
+							  skipFsync,
+							  RelFileLocatorBackendIsTemp(rlocator));
 }

 void
@@ -90,6 +114,7 @@ UmMetadataExtend(SMgrRelation reln, BlockNumber blkno, const void *buffer,
 void
 UmMetadataImmediateSync(SMgrRelation reln)
 {
+	MapCheckpointRelation(reln->smgr_rlocator.locator);
 	umfile_immedsync(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM);
 }

@@ -99,10 +124,32 @@ UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo)
umfile_unlink(rlocator, UMBRA_METADATA_FORKNUM, isRedo);
}

+void
+UmInvalidateDatabase(Oid dbid)
+{
+	FileTag		tag;
+	RelFileLocator rlocator;
+
+	MapInvalidateDatabase(dbid);
+
+	rlocator.spcOid = 0;
+	rlocator.dbOid = dbid;
+	rlocator.relNumber = 0;
+
+	memset(&tag, 0, sizeof(tag));
+	tag.handler = SYNC_HANDLER_UMBRA;
+	tag.rlocator = rlocator;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = InvalidBlockNumber;
+
+	RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true);
+}
+
 void
 uminit(void)
 {
 	umfile_init();
+	MapBackendInit();
 }

void
@@ -131,10 +178,9 @@ umdestroy(SMgrRelation reln)
{
UmbraSmgrRelationState *state = reln->smgr_private;

-	umfile_ctx_release(reln->smgr_rlocator);
-
 	if (state != NULL)
 	{
+		umfile_ctx_forget(reln->smgr_rlocator);
 		pfree(state);
 		reln->smgr_private = NULL;
 	}
@@ -146,18 +192,56 @@ umisinternalfork(ForkNumber forknum)
 	return forknum == UMBRA_METADATA_FORKNUM;
 }

+bool
+umcreatedballowswallog(void)
+{
+	return false;
+}
+
+void
+umcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
+								const Oid *tablespace_ids)
+{
+	MapCheckpointDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
+}
+
+void
+uminvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
+								const Oid *tablespace_ids)
+{
+	MapInvalidateDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
+}
+
 void
 umcreaterelationmetadata(SMgrRelation reln)
 {
+	UmbraFileContext *ctx = um_relation_filectx(reln);
 	bool		created = false;

-	if (!UmMetadataOpenOrCreate(reln, false, &created))
+	/*
+	 * smgrcreaterelationmetadata() is used both in normal create and redo
+	 * paths, so tolerate an already-existing metadata fork here.
+	 */
+	if (!UmMetadataOpenOrCreate(reln, true, &created))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create Umbra metadata fork for relation %u/%u/%u",
 						reln->smgr_rlocator.locator.spcOid,
 						reln->smgr_rlocator.locator.dbOid,
 						reln->smgr_rlocator.locator.relNumber)));
+
+	elog(DEBUG1, "umbra metadata open/create %u/%u/%u created=%s",
+		 reln->smgr_rlocator.locator.spcOid,
+		 reln->smgr_rlocator.locator.dbOid,
+		 reln->smgr_rlocator.locator.relNumber,
+		 created ? "true" : "false");
+
+	if (created)
+		MapSBlockInit(ctx, reln->smgr_rlocator.locator, InvalidXLogRecPtr);
+	else
+		(void) MapSBlockEnsureLoaded(ctx, reln->smgr_rlocator.locator);
+
+	um_refresh_identity_metadata(reln);
 }

void
@@ -166,7 +250,6 @@ umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst, char relpersistence)
BlockNumber src_nblocks;
BlockNumber dst_nblocks;
PGIOAlignedBlock pagebuf;
- bool created = false;

if (relpersistence != RELPERSISTENCE_PERMANENT)
return;
@@ -174,13 +257,7 @@ umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst, char relpersistence)
if (!UmMetadataExists(src))
return;

-	if (!UmMetadataOpenOrCreate(dst, false, &created))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create Umbra metadata fork for relation %u/%u/%u",
-						dst->smgr_rlocator.locator.spcOid,
-						dst->smgr_rlocator.locator.dbOid,
-						dst->smgr_rlocator.locator.relNumber)));
+	umcreaterelationmetadata(dst);

 	src_nblocks = UmMetadataNblocks(src);
 	dst_nblocks = UmMetadataNblocks(dst);
@@ -209,7 +286,7 @@ umsyncrelationmetadata(SMgrRelation reln)
 void
 umunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
 {
-	umfile_ctx_forget(rlocator);
+	MapInvalidateRelation(rlocator.locator);
 	UmMetadataUnlink(rlocator, isRedo);
 }

@@ -218,8 +295,20 @@ umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
{
mdcreate(reln, forknum, isRedo);

-	if (um_tracks_identity_metadata(forknum))
-		um_identity_update_metadata(reln, forknum, 0, true, true);
+	/*
+	 * Redo for permanent relation creation reaches smgrcreate() directly, so
+	 * make sure the metadata fork exists before later recovery steps touch the
+	 * relation again.
+	 */
+	if (isRedo &&
+		forknum == MAIN_FORKNUM &&
+		!UmMetadataExists(reln))
+		umcreaterelationmetadata(reln);
+
+	if (forknum != MAIN_FORKNUM &&
+		um_tracks_identity_metadata(forknum) &&
+		UmMetadataExists(reln))
+		um_identity_update_metadata(reln, forknum, 0, true);
 }

 bool
@@ -234,7 +323,12 @@ umexists(SMgrRelation reln, ForkNumber forknum)
 void
 umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 {
-	umfile_ctx_forget(rlocator);
+	if (forknum == UMBRA_METADATA_FORKNUM ||
+		forknum == MAIN_FORKNUM ||
+		forknum == InvalidForkNumber)
+	{
+		MapInvalidateRelation(rlocator.locator);
+	}

 	if (forknum == UMBRA_METADATA_FORKNUM)
 	{
@@ -252,11 +346,11 @@ void
 umextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
+	um_ensure_redo_metadata(reln, forknum);
 	mdextend(reln, forknum, blocknum, buffer, skipFsync);

-	if (um_tracks_identity_metadata(forknum))
-		um_identity_update_metadata(reln, forknum, blocknum + 1, true,
-									skipFsync);
+	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
+		um_identity_update_metadata(reln, forknum, blocknum + 1, true);
 }

void
@@ -265,18 +359,19 @@ umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
{
BlockNumber target_nblocks;

+ um_ensure_redo_metadata(reln, forknum);
mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);

-	if (um_tracks_identity_metadata(forknum))
-	{
-		target_nblocks = blocknum + (BlockNumber) nblocks;
-		if (target_nblocks < blocknum)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("Umbra identity mapping block count overflow")));
-		um_identity_update_metadata(reln, forknum, target_nblocks, true,
-									skipFsync);
-	}
+	if (!um_tracks_identity_metadata(forknum) || !UmMetadataExists(reln))
+		return;
+
+	target_nblocks = blocknum + (BlockNumber) nblocks;
+	if (target_nblocks < blocknum)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("Umbra identity mapping block count overflow")));
+
+	um_identity_update_metadata(reln, forknum, target_nblocks, true);
 }

 bool
@@ -296,6 +391,7 @@ void
 umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		void **buffers, BlockNumber nblocks)
 {
+	um_ensure_redo_metadata(reln, forknum);
 	mdreadv(reln, forknum, blocknum, buffers, nblocks);
 }

@@ -303,6 +399,7 @@ void
 umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum, void **buffers, BlockNumber nblocks)
 {
+	um_ensure_redo_metadata(reln, forknum);
 	mdstartreadv(ioh, reln, forknum, blocknum, buffers, nblocks);
 }

@@ -310,7 +407,14 @@ void
 umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	um_ensure_redo_metadata(reln, forknum);
 	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+
+	if (InRecovery &&
+		um_tracks_identity_metadata(forknum) &&
+		UmMetadataExists(reln))
+		um_identity_update_metadata(reln, forknum, mdnblocks(reln, forknum),
+									true);
 }

 void
@@ -324,9 +428,8 @@ BlockNumber
 umnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	/*
-	 * Keep md.c responsible for the physical fork size query. mdtruncate()
-	 * relies on a preceding mdnblocks() call to have opened all active
-	 * segments.
+	 * Keep md.c responsible for physical fork size queries. mdtruncate()
+	 * relies on a preceding mdnblocks() call to have opened active segments.
 	 */
 	return mdnblocks(reln, forknum);
 }
@@ -337,8 +440,8 @@ umtruncate(SMgrRelation reln, ForkNumber forknum,
 {
 	mdtruncate(reln, forknum, old_blocks, nblocks);

-	if (um_tracks_identity_metadata(forknum))
-		um_identity_update_metadata(reln, forknum, nblocks, true, false);
+	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
+		um_identity_update_metadata(reln, forknum, nblocks, true);
 }

void
@@ -362,6 +465,56 @@ umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
return mdfd(reln, forknum, blocknum, off);
}

+int
+umsyncfiletag(const FileTag *ftag, char *path)
+{
+	File		fd;
+	int			ret;
+	int			save_errno;
+
+	um_filetag_path(ftag, path);
+
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		return -1;
+
+	ret = FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC);
+	save_errno = errno;
+
+	FileClose(fd);
+	errno = save_errno;
+	return ret;
+}
+
+int
+umunlinkfiletag(const FileTag *ftag, char *path)
+{
+	um_filetag_path(ftag, path);
+	return unlink(path);
+}
+
+bool
+umfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+{
+	if (ftag->forknum == InvalidForkNumber &&
+		ftag->segno == InvalidBlockNumber &&
+		ftag->rlocator.spcOid == 0 &&
+		ftag->rlocator.relNumber == 0)
+		return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
+
+	if (ftag->forknum == InvalidForkNumber &&
+		ftag->segno == InvalidBlockNumber)
+		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator);
+
+	if (ftag->segno == InvalidBlockNumber)
+		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
+			ftag->forknum == candidate->forknum;
+
+	return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
+		ftag->forknum == candidate->forknum &&
+		ftag->segno == candidate->segno;
+}
+
 static UmbraFileContext *
 um_relation_filectx(SMgrRelation reln)
 {
@@ -384,32 +537,102 @@ um_tracks_identity_metadata(ForkNumber forknum)
 		forknum == VISIBILITYMAP_FORKNUM;
 }

+static void
+um_ensure_redo_metadata(SMgrRelation reln, ForkNumber forknum)
+{
+	Assert(reln != NULL);
+
+	if (!InRecovery ||
+		RelFileLocatorBackendIsTemp(reln->smgr_rlocator) ||
+		!um_tracks_identity_metadata(forknum) ||
+		UmMetadataExists(reln))
+		return;
+
+	/*
+	 * Redo can materialize a new data fork via mdwritev()/mdextend() without a
+	 * preceding smgrcreate() callback, for example during CREATE DATABASE
+	 * WAL-log replay. Ensure metadata exists before MAP state is consulted or
+	 * checkpointed for that relation.
+	 */
+	elog(DEBUG1, "umbra redo ensure metadata %u/%u/%u fork=%d",
+		 reln->smgr_rlocator.locator.spcOid,
+		 reln->smgr_rlocator.locator.dbOid,
+		 reln->smgr_rlocator.locator.relNumber,
+		 forknum);
+	umcreaterelationmetadata(reln);
+}
+
 static void
 um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
-							BlockNumber nblocks, bool fork_exists,
-							bool skipFsync)
+							BlockNumber nblocks, bool fork_exists)
 {
-	MapSuperblock super;
+	UmbraFileContext *ctx = um_relation_filectx(reln);
+	BlockNumber logical_nblocks;

Assert(reln != NULL);
Assert(um_tracks_identity_metadata(forknum));
+ Assert(UmMetadataExists(reln));

-	if (!MapSBlockRead(reln, &super))
-		MapSuperblockInit(&super, 0);
+	if (!MapSBlockEnsureLoaded(ctx, reln->smgr_rlocator.locator))
+		elog(ERROR, "could not load MAP superblock for relation %u/%u/%u",
+			 reln->smgr_rlocator.locator.spcOid,
+			 reln->smgr_rlocator.locator.dbOid,
+			 reln->smgr_rlocator.locator.relNumber);

 	if (!fork_exists && forknum != MAIN_FORKNUM)
+		logical_nblocks = InvalidBlockNumber;
+	else
+		logical_nblocks = nblocks;
+
+	MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+							   forknum, logical_nblocks,
+							   InvalidXLogRecPtr);
+
+	if (fork_exists || forknum == MAIN_FORKNUM)
 	{
-		MapSuperblockSetLogicalNblocks(&super, forknum, InvalidBlockNumber);
-		MapSuperblockSetNextFreePhysBlock(&super, forknum, InvalidBlockNumber);
-		MapSuperblockSetPhysCapacity(&super, forknum, InvalidBlockNumber);
+		MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
+									   forknum, nblocks,
+									   InvalidXLogRecPtr);
+		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+									 forknum, nblocks,
+									 InvalidXLogRecPtr);
 	}
-	else
+}
+
+static void
+um_refresh_identity_metadata(SMgrRelation reln)
+{
+	ForkNumber	forknum;
+
+	Assert(UmMetadataExists(reln));
+
+	for (forknum = MAIN_FORKNUM; forknum <= VISIBILITYMAP_FORKNUM; forknum++)
 	{
-		MapSuperblockSetLogicalNblocks(&super, forknum, nblocks);
-		MapSuperblockSetNextFreePhysBlock(&super, forknum, nblocks);
-		MapSuperblockSetPhysCapacity(&super, forknum, nblocks);
+		bool		fork_exists;
+		BlockNumber nblocks;
+
+		if (!um_tracks_identity_metadata(forknum))
+			continue;
+
+		fork_exists = mdexists(reln, forknum);
+		nblocks = fork_exists ? mdnblocks(reln, forknum) : 0;
+		um_identity_update_metadata(reln, forknum, nblocks, fork_exists);
 	}
+}

-	MapSuperblockSetLastUpdatedLSN(&super, InvalidXLogRecPtr);
-	MapSBlockWrite(reln, &super, skipFsync);
+static void
+um_filetag_path(const FileTag *ftag, char *path)
+{
+	RelPathStr	base;
+
+	if (ftag->forknum == UMBRA_METADATA_FORKNUM)
+		base = UmMetadataRelPathPerm(ftag->rlocator);
+	else
+		base = relpathperm(ftag->rlocator, ftag->forknum);
+
+	if (ftag->segno == 0)
+		strlcpy(path, base.str, MAXPGPATH);
+	else
+		snprintf(path, MAXPGPATH, "%s.%llu",
+				 base.str, (unsigned long long) ftag->segno);
 }
diff --git a/src/backend/storage/smgr/umfile.c b/src/backend/storage/smgr/umfile.c
index f8d1140840..17145405cf 100644
--- a/src/backend/storage/smgr/umfile.c
+++ b/src/backend/storage/smgr/umfile.c
@@ -1,32 +1,40 @@
 /*-------------------------------------------------------------------------
  *
  * umfile.c
- *	  Umbra backend-local file/segment helpers.
+ *	  Umbra file/segment manager.
  *
  * This layer owns backend-local file contexts keyed by RelFileLocatorBackend
- * and provides physical fork/segment management beneath Umbra metadata and
- * mapping code.
- *
- * src/backend/storage/smgr/umfile.c
+ * and provides low-level physical file/segment handling for Umbra forks.
  *
  *-------------------------------------------------------------------------
  */
+
 #include "postgres.h"

-#include <fcntl.h>
 #include <unistd.h>
+#include <fcntl.h>
+#include <sys/uio.h>

 #include "access/xlogutils.h"
-#include "commands/tablespace.h"
+#include "catalog/pg_tablespace_d.h"
 #include "common/relpath.h"
+#include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
-#include "storage/um_defs.h"
+#include "storage/sync.h"
 #include "storage/umfile.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/wait_event.h"

+/*
+ * Like md.c, we split relation storage into segments of RELSEG_SIZE blocks.
+ */
+
 /* Behavior flags for segment open helpers. */
 #define UM_EXTENSION_FAIL				(1 << 0)
 #define UM_EXTENSION_RETURN_NULL		(1 << 1)
@@ -34,6 +42,10 @@
 #define UM_EXTENSION_CREATE_RECOVERY	(1 << 3)
 #define UM_EXTENSION_DONT_OPEN			(1 << 5)

+/* local state */
+static MemoryContext UmCxt = NULL;
+static HTAB *UmCtxRegistry = NULL;
+
 typedef struct UmCtxRegistryEntry
 {
 	RelFileLocatorBackend rlocator;
@@ -49,71 +61,48 @@ typedef struct UmfdVec
 struct UmbraFileContext
 {
 	RelFileLocatorBackend rlocator;
+
 	int			num_open_segs[UMBRA_FORK_SLOTS];
-	UmfdVec    *seg_fds[UMBRA_FORK_SLOTS];
-	uint32		refcount;
+	UmfdVec	   *seg_fds[UMBRA_FORK_SLOTS];	/* array [0..num_open_segs) */
 };

-static MemoryContext UmFileCxt = NULL;
-static HTAB *UmFileContextHash = NULL;
-
-static void umfile_ctx_registry_init(void);
-static UmbraFileContext *umfile_ctx_create(RelFileLocatorBackend rlocator);
-static void umfile_ctx_destroy(UmbraFileContext *ctx);
-static void umfile_close_open_segments(UmbraFileContext *ctx,
-									   ForkNumber forknum);
-static bool umfile_create(UmbraFileContext *ctx, ForkNumber forknum,
-						  bool isRedo);
-static int	umfile_open_flags(void);
-static void umfile_fdvec_resize(UmbraFileContext *ctx, ForkNumber forknum,
-								int nseg);
-static inline UmfdVec *umfile_v_get(UmbraFileContext *ctx,
-									ForkNumber forknum, int segindex);
-static BlockNumber umfile_nblocks_in_seg(File vfd);
-static RelPathStr umfile_segpath(RelFileLocatorBackend rlocator,
-								 ForkNumber forknum, BlockNumber segno);
-static UmfdVec *umfile_openseg(UmbraFileContext *ctx,
-							   RelFileLocatorBackend rlocator,
-							   ForkNumber forknum,
-							   BlockNumber segno, int oflags);
-static UmfdVec *umfile_openfork(UmbraFileContext *ctx,
-								RelFileLocatorBackend rlocator,
+/* Forward declarations for internal ctx+rlocator core helpers. */
+static UmfdVec *umfile_openfork(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
 								ForkNumber forknum, int behavior);
-static UmfdVec *umfile_getseg(UmbraFileContext *ctx,
-							  RelFileLocatorBackend rlocator,
+static UmfdVec *umfile_openseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+							   ForkNumber forknum, BlockNumber segno, int oflags);
+static UmfdVec *umfile_getseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
 							  ForkNumber forknum, BlockNumber blkno,
-							  bool skipFsync, int behavior);
-static bool umfile_fork_has_open_segment(UmbraFileContext *ctx,
+							  bool skipFsync, int behavior,
+							  bool isTempRelation);
+static void umfile_register_dirty_seg(RelFileLocatorBackend rlocator,
+									  bool isTempRelation,
+									  ForkNumber forknum, UmfdVec *seg);
+static bool umfile_fork_allows_sparse_segments(ForkNumber forknum);
+static BlockNumber umfile_nblocks_sparse(UmbraFileContext *ctx,
+										 RelFileLocatorBackend rlocator,
 										 ForkNumber forknum);
+static BlockNumber umfile_nblocks_dense(UmbraFileContext *ctx,
+										RelFileLocatorBackend rlocator,
+										ForkNumber forknum);
+static BlockNumber umfile_nblocks_in_seg(File vfd);
+static bool umfile_collect_existing_segnos_by_path(const char *seg0path,
+												   BlockNumber **segnos_out,
+												   int *nsegnos_out);
+static bool umfile_any_segment_exists_by_path(const char *seg0path);
+static inline UmfdVec *umfile_v_get(UmbraFileContext *ctx, ForkNumber forknum,
+									int segindex);
+static bool umfile_fork_has_open_segment(UmbraFileContext *ctx, ForkNumber forknum);
 static bool umfile_fork_has_open_segment_on_disk(UmbraFileContext *ctx,
 												 RelFileLocatorBackend rlocator,
 												 ForkNumber forknum);
 static inline bool umfile_seg_entry_is_open(const UmfdVec *seg);
 static inline void umfile_seg_entry_reset(UmfdVec *seg);
-
-void
-umfile_init(void)
-{
-	HASHCTL		ctl;
-
-	if (UmFileContextHash != NULL)
-		return;
-
-	UmFileCxt = AllocSetContextCreate(TopMemoryContext,
-									  "UmFile",
-									  ALLOCSET_DEFAULT_SIZES);
-	MemoryContextAllowInCriticalSection(UmFileCxt, true);
-
-	memset(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(RelFileLocatorBackend);
-	ctl.entrysize = sizeof(UmCtxRegistryEntry);
-	ctl.hcxt = UmFileCxt;
-
-	UmFileContextHash = hash_create("Umbra file context registry",
-									256,
-									&ctl,
-									HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-}
+static void umfile_build_segpath(UmbraFileContext *ctx, ForkNumber forknum,
+								 BlockNumber segno, char *path, size_t pathlen);
+static void umfile_ctx_registry_init(void);
+static UmbraFileContext *umfile_ctx_create(RelFileLocatorBackend rlocator);
+static void umfile_ctx_destroy_internal(UmbraFileContext *ctx);

UmbraFileContext *
umfile_ctx_lookup(RelFileLocatorBackend rlocator)
@@ -121,7 +110,7 @@ umfile_ctx_lookup(RelFileLocatorBackend rlocator)
UmCtxRegistryEntry *entry;

 	umfile_ctx_registry_init();
-	entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
+	entry = hash_search(UmCtxRegistry, &rlocator, HASH_FIND, NULL);
 	if (entry == NULL)
 		return NULL;

@@ -132,13 +121,12 @@ UmbraFileContext *
 umfile_ctx_acquire(RelFileLocatorBackend rlocator)
 {
 	UmCtxRegistryEntry *entry;
-	bool		found;
+	bool found;

 	umfile_ctx_registry_init();
-	entry = hash_search(UmFileContextHash, &rlocator, HASH_ENTER, &found);
+	entry = hash_search(UmCtxRegistry, &rlocator, HASH_ENTER, &found);
 	if (!found)
 		entry->ctx = umfile_ctx_create(rlocator);
-	entry->ctx->refcount++;

return entry->ctx;
}
@@ -150,79 +138,43 @@ umfile_ctx_create_temporary(RelFileLocatorBackend rlocator)
return umfile_ctx_create(rlocator);
}

-void
-umfile_ctx_destroy_temporary(UmbraFileContext *ctx)
-{
- if (ctx == NULL)
- return;
-
- umfile_ctx_destroy(ctx);
-}
-
-void
-umfile_ctx_release(RelFileLocatorBackend rlocator)
-{
- UmCtxRegistryEntry *entry;
- UmbraFileContext *ctx;
-
- if (UmFileContextHash == NULL)
- return;
-
- entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
- if (entry == NULL)
- return;
-
- ctx = entry->ctx;
- Assert(ctx->refcount > 0);
- ctx->refcount--;
-
- if (ctx->refcount == 0)
- {
- umfile_ctx_destroy(ctx);
- (void) hash_search(UmFileContextHash, &rlocator, HASH_REMOVE, NULL);
- }
-}
-
void
umfile_ctx_forget(RelFileLocatorBackend rlocator)
{
UmCtxRegistryEntry *entry;
- UmbraFileContext *ctx;

-	if (UmFileContextHash == NULL)
+	if (UmCtxRegistry == NULL)
 		return;

-	entry = hash_search(UmFileContextHash, &rlocator, HASH_FIND, NULL);
+	entry = hash_search(UmCtxRegistry, &rlocator, HASH_FIND, NULL);
 	if (entry == NULL)
 		return;

-	ctx = entry->ctx;
-	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
-		umfile_close_open_segments(ctx, forknum);
-
-	if (ctx->refcount == 0)
-	{
-		umfile_ctx_destroy(ctx);
-		(void) hash_search(UmFileContextHash, &rlocator, HASH_REMOVE, NULL);
-	}
+	umfile_ctx_destroy_internal(entry->ctx);
+	(void) hash_search(UmCtxRegistry, &rlocator, HASH_REMOVE, NULL);
 }

 void
-umfile_ctx_close_fork(UmbraFileContext *ctx, ForkNumber forknum)
+umfile_ctx_destroy_temporary(UmbraFileContext *ctx)
 {
-	if (ctx == NULL)
-		return;
-
-	umfile_close_open_segments(ctx, forknum);
+	umfile_ctx_destroy_internal(ctx);
 }

+/*
+ * MAP-layer context helpers
+ *
+ * These operate directly on the ctx+rlocator core.
+ *
+ * Important: these helpers intentionally do not register fsync requests for
+ * writes/extends. The MAP layer calls umfile_ctx_register_dirty() explicitly.
+ */
+
 bool
 umfile_ctx_fork_exists(UmbraFileContext *ctx, ForkNumber forknum,
 					   UmFileExistsMode mode)
 {
 	if (ctx == NULL)
 		return false;
-
 	return umfile_exists(ctx, forknum, mode);
 }

@@ -234,83 +186,117 @@ umfile_ctx_get_nblocks(UmbraFileContext *ctx, ForkNumber forknum,
return umfile_nblocks(ctx, forknum, mode);
}

+static void
+umfile_ctx_ensure_fork(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	Assert(ctx != NULL);
+	if (!umfile_exists(ctx, forknum,
+					   umfile_fork_allows_sparse_segments(forknum) ?
+					   UMFILE_EXISTS_SPARSE :
+					   UMFILE_EXISTS_DENSE))
+		umfile_create(ctx, forknum, false /* isRedo */ );
+}
+
+static void
+umfile_ctx_ensure_block_exists(UmbraFileContext *ctx, ForkNumber forknum,
+							   BlockNumber blkno)
+{
+	Assert(ctx != NULL);
+
+	if (umfile_ctx_block_exists(ctx, forknum, blkno))
+		return;
+
+	/*
+	 * Materialize just the requested block. For sparse mapped forks we do not
+	 * need an authoritative current EOF here; FileZero() can create the target
+	 * segment and make blkno BLCKSZ-addressable directly.
+	 */
+	Assert(blkno < MaxBlockNumber);
+	umfile_zeroextend(ctx, forknum, blkno, 1, true /* skipFsync */ );
+}
+
 void
 umfile_ctx_read(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
 				char *buffer, int nbytes)
 {
-	UmfdVec    *seg;
-	off_t		offset;
-	ssize_t		got;
+	UmfdVec	   *v;
+	off_t		seekpos;
+	int			got;

Assert(ctx != NULL);
Assert(buffer != NULL);
Assert(nbytes > 0 && nbytes <= BLCKSZ);

-	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
-						false,
-						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY);
-	offset = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
-	got = FileRead(seg->umfd_vfd, buffer, nbytes, offset,
+	v = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
+					  false /* skipFsync */,
+					  UM_EXTENSION_FAIL,
+					  false /* isTempRelation */);
+	seekpos = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
+
+	got = FileRead(v->umfd_vfd, buffer, nbytes, seekpos,
 				   WAIT_EVENT_DATA_FILE_READ);
-	if (got < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read block %u in file \"%s\": %m",
-						blkno, FilePathName(seg->umfd_vfd))));
 	if (got != nbytes)
+	{
+		if (got < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m",
+							FilePathName(v->umfd_vfd))));
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("could not read block %u in file \"%s\"",
-						blkno, FilePathName(seg->umfd_vfd)),
-				 errdetail("Read only %zd of %d bytes.", got, nbytes)));
+				 errmsg("could not read file \"%s\": read only %d of %d bytes at block %u",
+						FilePathName(v->umfd_vfd), got, nbytes, blkno)));
+	}
 }

 void
 umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
 				 const char *buffer, int nbytes, bool skipFsync)
 {
-	UmfdVec    *seg;
-	BlockNumber	nblocks;
-	off_t		offset;
-	ssize_t		wrote;
+	UmfdVec	   *v;
+	off_t		seekpos;
+	int			wrote;

Assert(ctx != NULL);
Assert(buffer != NULL);
Assert(nbytes > 0 && nbytes <= BLCKSZ);

-	nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
-	if (blkno >= nblocks)
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("cannot overwrite block %u in relation %u/%u/%u fork %d",
-						blkno,
-						ctx->rlocator.locator.spcOid,
-						ctx->rlocator.locator.dbOid,
-						ctx->rlocator.locator.relNumber,
-						forknum),
-				 errdetail("Current fork size is %u blocks.", nblocks)));
-
-	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
-						skipFsync,
-						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY);
-	offset = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
-	wrote = FileWrite(seg->umfd_vfd, buffer, nbytes, offset,
+	/*
+	 * Ensure the target block exists at BLCKSZ granularity even if we're about
+	 * to write only a sector-sized header.
+	 */
+	umfile_ctx_ensure_block_exists(ctx, forknum, blkno);
+
+	v = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
+					  true /* skipFsync */,
+					  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE,
+					  false /* isTempRelation */);
+	seekpos = (off_t) BLCKSZ * (blkno % ((BlockNumber) RELSEG_SIZE));
+
+	wrote = FileWrite(v->umfd_vfd, buffer, nbytes, seekpos,
 					  WAIT_EVENT_DATA_FILE_WRITE);
-	if (wrote < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write block %u in file \"%s\": %m",
-						blkno, FilePathName(seg->umfd_vfd))));
 	if (wrote != nbytes)
+	{
+		if (wrote < 0 && errno == ENOSPC)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write file \"%s\": %m",
+							FilePathName(v->umfd_vfd)),
+					 errhint("Check free disk space.")));
+		if (wrote < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write file \"%s\": %m",
+							FilePathName(v->umfd_vfd))));
 		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\"",
-						blkno, FilePathName(seg->umfd_vfd)),
-				 errdetail("Wrote only %zd of %d bytes.", wrote, nbytes)));
+				(errcode_for_file_access(),
+				 errmsg("could not write file \"%s\": wrote only %d of %d bytes at block %u",
+						FilePathName(v->umfd_vfd), wrote, nbytes, blkno)));
+	}

 	/*
-	 * Sync policy is explicit at this layer: callers use
-	 * umfile_registersync()/umfile_immedsync() for durable requests.
+	 * Intentionally do not register dirty here. The MAP layer does that via
+	 * umfile_ctx_register_dirty() so it can control skipFsync consistently.
 	 */
 	(void) skipFsync;
 }
@@ -319,828 +305,2074 @@ void
 umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
 				  const char *buffer)
 {
-	BlockNumber	nblocks;
-
 	Assert(ctx != NULL);
 	Assert(buffer != NULL);

-	(void) umfile_open_or_create(ctx, forknum, false, NULL);
-	nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
-	if (blkno != nblocks)
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("cannot extend relation %u/%u/%u fork %d at block %u",
-						ctx->rlocator.locator.spcOid,
-						ctx->rlocator.locator.dbOid,
-						ctx->rlocator.locator.relNumber,
-						forknum, blkno),
-				 errdetail("Expected next block %u.", nblocks)));
+	umfile_ctx_ensure_fork(ctx, forknum);

-	umfile_extend(ctx, forknum, blkno, buffer, true);
+	/* Use the existing extension path but suppress fsync registration. */
+	umfile_extend(ctx, forknum, blkno, buffer, true /* skipFsync */ );
 }

 void
-umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
-					  bool isRedo)
+umfile_ctx_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno)
 {
-	umfile_unlink(rlocator, forknum, isRedo);
+	if (ctx == NULL)
+		return;
+	(void) umfile_prefetch(ctx, forknum, blkno, 1);
 }

 bool
-umfile_metadata_exists(UmbraFileContext *ctx)
+umfile_ctx_block_exists(UmbraFileContext *ctx, ForkNumber forknum,
+						BlockNumber blkno)
 {
-	return umfile_exists(ctx, UMBRA_METADATA_FORKNUM, UMFILE_EXISTS_DENSE);
+	UmfdVec	   *v;
+	BlockNumber	segno;
+	BlockNumber	segblocks;
+
+	if (ctx == NULL)
+		return false;
+
+	segno = blkno / ((BlockNumber) RELSEG_SIZE);
+
+	if (segno < (BlockNumber) ctx->num_open_segs[forknum])
+	{
+		v = umfile_v_get(ctx, forknum, (int) segno);
+		if (!umfile_seg_entry_is_open(v))
+			return false;
+	}
+	else
+	{
+		/*
+		 * Dense forks can only materialize the next segment in order. Sparse
+		 * forks may legitimately skip lower segments.
+		 */
+		if (!umfile_fork_allows_sparse_segments(forknum) &&
+			segno > (BlockNumber) ctx->num_open_segs[forknum])
+			return false;
+
+		v = umfile_openseg(ctx, ctx->rlocator, forknum, segno,
+						   UM_EXTENSION_RETURN_NULL);
+		if (v == NULL)
+			return false;
+	}
+
+	segblocks = umfile_nblocks_in_seg(v->umfd_vfd);
+	return (blkno % ((BlockNumber) RELSEG_SIZE)) < segblocks;
 }

 bool
-umfile_metadata_open_or_create(UmbraFileContext *ctx, bool isRedo, bool *created)
+umfile_ctx_segment_exists(UmbraFileContext *ctx, ForkNumber forknum,
+						  BlockNumber segno)
 {
-	return umfile_open_or_create(ctx, UMBRA_METADATA_FORKNUM, isRedo, created);
-}
+	char		path[MAXPGPATH];

-BlockNumber
-umfile_metadata_nblocks(UmbraFileContext *ctx)
-{
-	return umfile_nblocks(ctx, UMBRA_METADATA_FORKNUM, UMFILE_NBLOCKS_DENSE);
+	if (ctx == NULL)
+		return false;
+
+	umfile_build_segpath(ctx, forknum, segno, path, sizeof(path));
+	return access(path, F_OK) == 0;
 }

 void
-umfile_metadata_read(UmbraFileContext *ctx, BlockNumber blkno, void *buffer)
+umfile_ctx_register_dirty(UmbraFileContext *ctx, ForkNumber forknum,
+						  BlockNumber blkno, bool skipFsync,
+						  bool isTempRelation)
 {
-	void	   *buffers[1];
+	UmfdVec	   *v;

-	buffers[0] = buffer;
-	umfile_readv(ctx, UMBRA_METADATA_FORKNUM, blkno, buffers, 1);
-}
+	if (skipFsync || isTempRelation)
+		return;

-void
-umfile_metadata_write(UmbraFileContext *ctx, BlockNumber blkno, const void *buffer)
-{
-	const void *buffers[1];
+	Assert(ctx != NULL);

-	buffers[0] = buffer;
-	umfile_writev(ctx, UMBRA_METADATA_FORKNUM, blkno, buffers, 1, false);
+	/*
+	 * Ensure we can fall back to immediate fsync if the sync request queue is
+	 * full, mirroring md.c behavior.
+	 */
+	v = umfile_getseg(ctx, ctx->rlocator, forknum, blkno,
+					  false /* skipFsync */,
+					  UM_EXTENSION_FAIL,
+					  isTempRelation);
+	umfile_register_dirty_seg(ctx->rlocator, isTempRelation, forknum, v);
 }

 void
-umfile_metadata_extend(UmbraFileContext *ctx, BlockNumber blkno, const void *buffer)
+umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator, ForkNumber forkNum,
+					  bool isRedo)
 {
-	umfile_extend(ctx, UMBRA_METADATA_FORKNUM, blkno, buffer, false);
+	umfile_unlink(rlocator, forkNum, isRedo);
 }

-void
-umfile_metadata_immedsync(UmbraFileContext *ctx)
+/*
+ * Build a FileTag for Umbra relation segment files.  MAP fork uses Umbra-only
+ * naming and cannot safely reuse md's unlink callback.
+ */
+#define INIT_UM_FILETAG(tag, rlocator_, forknum_, segno_)	\
+	do {												\
+		memset(&(tag), 0, sizeof(FileTag));				\
+		(tag).handler = SYNC_HANDLER_UMBRA;			\
+		(tag).rlocator = (rlocator_);					\
+		(tag).forknum = (forknum_);						\
+		(tag).segno = (segno_);							\
+	} while (0)
+
+static inline int
+_umfd_open_flags(void)
 {
-	umfile_immedsync(ctx, UMBRA_METADATA_FORKNUM);
-}
+	int			flags = O_RDWR | PG_BINARY;

-void
-umfile_metadata_unlink(RelFileLocatorBackend rlocator, bool isRedo)
-{
-	umfile_unlink(rlocator, UMBRA_METADATA_FORKNUM, isRedo);
+	if (io_direct_flags & IO_DIRECT_DATA)
+		flags |= PG_O_DIRECT;
+
+	return flags;
 }

-bool
-umfile_exists(UmbraFileContext *ctx, ForkNumber forknum, UmFileExistsMode mode)
+static void
+umfile_fdvec_resize(UmbraFileContext *ctx, ForkNumber forknum, int nseg)
 {
-	Assert(ctx != NULL);
-	(void) mode;
+	Assert(nseg >= 0);

-	if (umfile_fork_has_open_segment(ctx, forknum))
+	if (nseg == 0)
 	{
-		if (umfile_fork_has_open_segment_on_disk(ctx, ctx->rlocator, forknum))
-			return true;
+		if (ctx->num_open_segs[forknum] > 0)
+		{
+			pfree(ctx->seg_fds[forknum]);
+			ctx->seg_fds[forknum] = NULL;
+		}
+		ctx->seg_fds[forknum] = NULL;
+		ctx->num_open_segs[forknum] = 0;
+		return;
+	}

-		umfile_close_open_segments(ctx, forknum);
+	if (ctx->num_open_segs[forknum] == 0)
+	{
+		ctx->seg_fds[forknum] =
+			MemoryContextAlloc(UmCxt, sizeof(UmfdVec) * nseg);
+	}
+	else if (nseg > ctx->num_open_segs[forknum])
+	{
+		ctx->seg_fds[forknum] =
+			repalloc(ctx->seg_fds[forknum],
+					 sizeof(UmfdVec) * nseg);
+	}
+	else
+	{
+		/*
+		 * Don't reallocate a smaller array: keep truncate usable in critical
+		 * sections (mirrors md.c behavior).
+		 */
 	}

-	return (umfile_openfork(ctx, ctx->rlocator, forknum,
-							UM_EXTENSION_RETURN_NULL) != NULL);
+	ctx->num_open_segs[forknum] = nseg;
 }

-bool
-umfile_open_or_create(UmbraFileContext *ctx, ForkNumber forknum,
-					  bool isRedo, bool *created)
+static inline UmfdVec *
+umfile_v_get(UmbraFileContext *ctx, ForkNumber forknum, int segindex)
+{
+	Assert(segindex >= 0);
+	Assert(segindex < ctx->num_open_segs[forknum]);
+	return &ctx->seg_fds[forknum][segindex];
+}
+
+static BlockNumber
+umfile_nblocks_in_seg(File vfd)
 {
-	UmfdVec    *seg;
-	bool		was_created;
+	off_t		len;

-	Assert(ctx != NULL);
+	len = FileSize(vfd);
+	if (len < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to end of file \"%s\": %m",
+						FilePathName(vfd))));

-	if (created != NULL)
-		*created = false;
+	return (BlockNumber) (len / BLCKSZ);
+}

-	seg = umfile_openfork(ctx, ctx->rlocator, forknum,
-						  UM_EXTENSION_RETURN_NULL);
-	if (seg != NULL)
-		return true;
+static RelPathStr
+umfile_segpath(RelFileLocatorBackend rlocator, ForkNumber forknum, BlockNumber segno)
+{
+	RelPathStr	base;
+	RelPathStr	fullpath;

-	was_created = umfile_create(ctx, forknum, isRedo);
-	if (created != NULL)
-		*created = was_created;
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		base = UmMetadataRelPathBackend(rlocator);
+	else
+		base = relpath(rlocator, forknum);

-	return true;
+	if (segno == 0)
+		return base;
+
+	snprintf(fullpath.str, sizeof(fullpath.str), "%s.%u", base.str, segno);
+	return fullpath;
 }

-BlockNumber
-umfile_nblocks(UmbraFileContext *ctx, ForkNumber forknum, UmFileNblocksMode mode)
+static UmfdVec *
+umfile_openseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+			   ForkNumber forknum, BlockNumber segno, int oflags)
 {
-	UmfdVec    *seg;
-	BlockNumber	segno;
-	BlockNumber	nblocks;
+	UmfdVec    *v;
+	RelPathStr	fullpath;
+	File		fd;
+	int			old_nseg;
+	int			i;

-	Assert(ctx != NULL);
-	(void) mode;
+	fullpath = umfile_segpath(rlocator, forknum, segno);

-	if (umfile_openfork(ctx, ctx->rlocator, forknum,
-						UM_EXTENSION_RETURN_NULL) == NULL)
-		return 0;
+	fd = PathNameOpenFile(fullpath.str, _umfd_open_flags() | oflags);

-	Assert(ctx->num_open_segs[forknum] > 0);
-	segno = ctx->num_open_segs[forknum] - 1;
-	seg = umfile_v_get(ctx, forknum, segno);
+	if (fd < 0)
+		return NULL;

-	for (;;)
+	old_nseg = ctx->num_open_segs[forknum];
+	if (umfile_fork_allows_sparse_segments(forknum))
 	{
-		nblocks = umfile_nblocks_in_seg(seg->umfd_vfd);
-		if (nblocks > (BlockNumber) RELSEG_SIZE)
-			elog(FATAL, "Umbra segment too big");
-		if (nblocks < (BlockNumber) RELSEG_SIZE)
-			return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
-
-		segno++;
-		seg = umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0);
-		if (seg == NULL)
-			return segno * ((BlockNumber) RELSEG_SIZE);
+		if (segno >= (BlockNumber) old_nseg)
+		{
+			umfile_fdvec_resize(ctx, forknum, segno + 1);
+			for (i = old_nseg; i < ctx->num_open_segs[forknum]; i++)
+				umfile_seg_entry_reset(umfile_v_get(ctx, forknum, i));
+		}
+		v = umfile_v_get(ctx, forknum, (int) segno);
+		Assert(!umfile_seg_entry_is_open(v));
+	}
+	else
+	{
+		/*
+		 * Segments are opened in increasing order, so we must be adding a new
+		 * one at the end.
+		 */
+		Assert(segno == (BlockNumber) old_nseg);
+		umfile_fdvec_resize(ctx, forknum, segno + 1);
+		v = umfile_v_get(ctx, forknum, (int) segno);
 	}
-}

-void
-umfile_readv(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
-			 void **buffers, BlockNumber nblocks)
-{
-	for (BlockNumber i = 0; i < nblocks; i++)
-		umfile_ctx_read(ctx, forknum, blocknum + i, buffers[i], BLCKSZ);
+	v->umfd_vfd = fd;
+	v->umfd_segno = segno;
+	Assert(umfile_nblocks_in_seg(v->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
+	return v;
 }

-void
-umfile_writev(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
-			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+static UmfdVec *
+umfile_openfork(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+				ForkNumber forknum, int behavior)
 {
-	for (BlockNumber i = 0; i < nblocks; i++)
-		umfile_ctx_write(ctx, forknum, blocknum + i, buffers[i], BLCKSZ,
-						 skipFsync);
-}
+	RelPathStr	path;
+	File		fd;
+	UmfdVec	   *v;

-void
-umfile_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
-			  const void *buffer, bool skipFsync)
-{
-	UmfdVec    *seg;
-	off_t		offset;
-	ssize_t		wrote;
+	/* No work if already open */
+	if (ctx->num_open_segs[forknum] > 0)
+		return umfile_v_get(ctx, forknum, 0);

-	Assert(ctx != NULL);
-	Assert(buffer != NULL);
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		path = UmMetadataRelPathBackend(rlocator);
+	else
+		path = relpath(rlocator, forknum);
+	fd = PathNameOpenFile(path.str, _umfd_open_flags());

-	seg = umfile_getseg(ctx, ctx->rlocator, forknum, blocknum,
-						skipFsync,
-						UM_EXTENSION_FAIL | UM_EXTENSION_CREATE);
-	offset = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-	wrote = FileWrite(seg->umfd_vfd, buffer, BLCKSZ, offset,
-					  WAIT_EVENT_DATA_FILE_EXTEND);
-	if (wrote < 0)
+	if (fd < 0)
+	{
+		if ((behavior & UM_EXTENSION_RETURN_NULL) &&
+			FILE_POSSIBLY_DELETED(errno))
+			return NULL;
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not extend file \"%s\": %m",
-						FilePathName(seg->umfd_vfd))));
-	if (wrote != BLCKSZ)
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not extend file \"%s\" at block %u",
-						FilePathName(seg->umfd_vfd), blocknum),
-				 errdetail("Wrote only %zd of %d bytes.", wrote, BLCKSZ)));
+				 errmsg("could not open file \"%s\": %m", path.str)));
+	}

-	(void) skipFsync;
+	umfile_fdvec_resize(ctx, forknum, 1);
+	v = umfile_v_get(ctx, forknum, 0);
+	v->umfd_vfd = fd;
+	v->umfd_segno = 0;
+
+	Assert(umfile_nblocks_in_seg(v->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
+
+	return v;
 }

-void
-umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum,
-				  BlockNumber blocknum, int nblocks, bool skipFsync)
+static bool
+umfile_fork_allows_sparse_segments(ForkNumber forknum)
 {
-	Assert(ctx != NULL);
-	Assert(nblocks >= 0);
-
-	while (nblocks > 0)
+	switch (forknum)
 	{
-		UmfdVec    *seg;
-		BlockNumber nblocks_this_segment;
-		off_t		offset;
-		int			ret;
-
-		seg = umfile_getseg(ctx, ctx->rlocator, forknum, blocknum,
-							skipFsync,
-							UM_EXTENSION_FAIL | UM_EXTENSION_CREATE);
-		offset = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-		nblocks_this_segment =
-			Min((BlockNumber) nblocks,
-				((BlockNumber) RELSEG_SIZE) -
-				(blocknum % ((BlockNumber) RELSEG_SIZE)));
-
-		ret = FileZero(seg->umfd_vfd,
-					   offset,
-					   (off_t) BLCKSZ * nblocks_this_segment,
-					   WAIT_EVENT_DATA_FILE_EXTEND);
-		if (ret < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not zero-extend file \"%s\": %m",
-							FilePathName(seg->umfd_vfd))));
-
-		nblocks -= nblocks_this_segment;
-		blocknum += nblocks_this_segment;
+		case MAIN_FORKNUM:
+		case FSM_FORKNUM:
+		case VISIBILITYMAP_FORKNUM:
+			return true;
+		default:
+			return false;
 	}
 }

-void
-umfile_truncate(UmbraFileContext *ctx, ForkNumber forknum,
-				BlockNumber old_blocks, BlockNumber nblocks)
+static bool umfile_collect_existing_segnos_by_path(const char *seg0path,
+												   BlockNumber **segnos_out,
+												   int *nsegnos_out);
+
+static bool
+umfile_sparse_fork_scan_segments(UmbraFileContext *ctx,
+								 ForkNumber forknum,
+								 BlockNumber *minsegno,
+								 BlockNumber *maxsegno)
 {
-	int			curopensegs;
+	char		seg0path[MAXPGPATH];
+	BlockNumber *segnos = NULL;
+	int			nsegnos = 0;

-	Assert(ctx != NULL);
+	Assert(umfile_fork_allows_sparse_segments(forknum));

-	if (nblocks > old_blocks)
-	{
-		if (InRecovery)
-			return;
+	umfile_build_segpath(ctx, forknum, 0, seg0path, sizeof(seg0path));
+	if (!umfile_collect_existing_segnos_by_path(seg0path, &segnos, &nsegnos))
+		return false;
+	if (nsegnos == 0)
+		return false;

-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("cannot truncate relation %u/%u/%u fork %d to %u blocks: current size is only %u blocks",
-						ctx->rlocator.locator.spcOid,
-						ctx->rlocator.locator.dbOid,
-						ctx->rlocator.locator.relNumber,
-						forknum,
-						nblocks,
-						old_blocks)));
-	}
+	if (minsegno != NULL)
+		*minsegno = segnos[0];
+	if (maxsegno != NULL)
+		*maxsegno = segnos[nsegnos - 1];
+	pfree(segnos);
+	return true;
+}

-	if (nblocks == old_blocks)
-		return;
+static bool
+umfile_any_segment_exists_by_path(const char *seg0path)
+{
+	char		dirpath[MAXPGPATH];
+	char	   *slash;
+	const char *basename;
+	size_t		baselen;
+	DIR		   *dir;
+	struct dirent *de;
+
+	Assert(seg0path != NULL);
+
+	strlcpy(dirpath, seg0path, sizeof(dirpath));
+	slash = strrchr(dirpath, '/');
+	if (slash == NULL)
+		return false;

-	/*
-	 * Bring all dense segments into the local array first, then trim from the
-	 * tail.  This keeps the truncate contract local to the file manager.
-	 */
-	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
-	curopensegs = ctx->num_open_segs[forknum];
+	*slash = '\0';
+	basename = slash + 1;
+	baselen = strlen(basename);

-	while (curopensegs > 0)
+	dir = AllocateDir(dirpath);
+	if (dir == NULL)
 	{
-		UmfdVec    *seg;
-		BlockNumber	priorblocks;
+		if (errno == ENOENT)
+			return false;
+		return false;
+	}

-		priorblocks = (curopensegs - 1) * ((BlockNumber) RELSEG_SIZE);
-		seg = umfile_v_get(ctx, forknum, curopensegs - 1);
+	while ((de = ReadDir(dir, dirpath)) != NULL)
+	{
+		const char *name = de->d_name;

-		if (priorblocks >= nblocks)
+		if (strcmp(name, basename) == 0)
 		{
-			if (FileTruncate(seg->umfd_vfd, 0, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not truncate file \"%s\": %m",
-							FilePathName(seg->umfd_vfd))));
+			FreeDir(dir);
+			return true;
+		}

-			if (seg != umfile_v_get(ctx, forknum, 0))
+		if (strncmp(name, basename, baselen) == 0 &&
+			name[baselen] == '.')
+		{
+			char	   *endptr = NULL;
+			unsigned long parsed;
+
+			errno = 0;
+			parsed = strtoul(name + baselen + 1, &endptr, 10);
+			if (errno == 0 &&
+				endptr != name + baselen + 1 &&
+				*endptr == '\0' &&
+				parsed <= MaxBlockNumber)
 			{
-				FileClose(seg->umfd_vfd);
-				umfile_fdvec_resize(ctx, forknum, curopensegs - 1);
+				FreeDir(dir);
+				return true;
 			}
 		}
-		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
-		{
-			BlockNumber	lastsegblocks;
+	}

-			lastsegblocks = nblocks - priorblocks;
-			if (FileTruncate(seg->umfd_vfd,
-							 (off_t) lastsegblocks * BLCKSZ,
-							 WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
-							FilePathName(seg->umfd_vfd),
-							nblocks)));
-		}
-		else
-			break;
+	FreeDir(dir);
+	return false;
+}

-		curopensegs--;
-	}
+static inline bool
+umfile_seg_entry_is_open(const UmfdVec *seg)
+{
+	return (seg != NULL && seg->umfd_vfd >= 0);
 }

-void
-umfile_immedsync(UmbraFileContext *ctx, ForkNumber forknum)
+static bool
+umfile_fork_has_open_segment(UmbraFileContext *ctx, ForkNumber forknum)
 {
-	int			segno;
-	int			min_inactive_seg;
+	int			i;

-	Assert(ctx != NULL);
+	for (i = 0; i < ctx->num_open_segs[forknum]; i++)
+	{
+		if (umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, i)))
+			return true;
+	}

-	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
-	min_inactive_seg = segno = ctx->num_open_segs[forknum];
+	return false;
+}

-	while (umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0) != NULL)
-		segno++;
+static bool
+umfile_fork_has_open_segment_on_disk(UmbraFileContext *ctx,
+									 RelFileLocatorBackend rlocator,
+									 ForkNumber forknum)
+{
+	int			i;
+	bool		have_live = false;

-	while (segno > 0)
+	for (i = 0; i < ctx->num_open_segs[forknum]; i++)
 	{
-		UmfdVec    *seg = umfile_v_get(ctx, forknum, segno - 1);
+		UmfdVec    *seg = umfile_v_get(ctx, forknum, i);
+		RelPathStr	path;

-		if (FileSync(seg->umfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->umfd_vfd))));
+		if (!umfile_seg_entry_is_open(seg))
+			continue;

-		if (segno > min_inactive_seg)
+		path = umfile_segpath(rlocator, forknum, seg->umfd_segno);
+		if (access(path.str, F_OK) == 0)
 		{
-			FileClose(seg->umfd_vfd);
-			umfile_fdvec_resize(ctx, forknum, segno - 1);
+			have_live = true;
+			continue;
 		}

-		segno--;
+		FileClose(seg->umfd_vfd);
+		umfile_seg_entry_reset(seg);
 	}
+
+	return have_live;
 }

-void
-umfile_registersync(UmbraFileContext *ctx, ForkNumber forknum)
+static inline void
+umfile_seg_entry_reset(UmfdVec *seg)
 {
-	/*
-	 * Registering durability at this boundary is implemented as an immediate
-	 * fsync.
-	 */
-	umfile_immedsync(ctx, forknum);
+	seg->umfd_vfd = -1;
+	seg->umfd_segno = InvalidBlockNumber;
 }

-void
-umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+static int
+umfile_compare_blocknumbers(const void *a, const void *b)
 {
-	if (forknum == InvalidForkNumber)
+	BlockNumber	va = *(const BlockNumber *) a;
+	BlockNumber	vb = *(const BlockNumber *) b;
+
+	if (va < vb)
+		return -1;
+	if (va > vb)
+		return 1;
+	return 0;
+}
+
+static bool
+umfile_collect_existing_segnos_by_path(const char *seg0path,
+									   BlockNumber **segnos_out,
+									   int *nsegnos_out)
+{
+	char		dirpath[MAXPGPATH];
+	char	   *slash;
+	const char *basename;
+	size_t		baselen;
+	DIR		   *dir;
+	struct dirent *de;
+	BlockNumber *segnos = NULL;
+	int			nsegnos = 0;
+	int			capacity = 0;
+	int			i;
+	int			uniq;
+
+	Assert(seg0path != NULL);
+	Assert(segnos_out != NULL);
+	Assert(nsegnos_out != NULL);
+
+	*segnos_out = NULL;
+	*nsegnos_out = 0;
+
+	strlcpy(dirpath, seg0path, sizeof(dirpath));
+	slash = strrchr(dirpath, '/');
+	if (slash == NULL)
+		return false;
+
+	*slash = '\0';
+	basename = slash + 1;
+	baselen = strlen(basename);
+
+	dir = AllocateDir(dirpath);
+	if (dir == NULL)
 	{
-		for (forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
-			umfile_unlink(rlocator, forknum, isRedo);
-		return;
+		if (errno == ENOENT)
+			return true;
+		return false;
 	}

-	for (BlockNumber segno = 0;; segno++)
+	while ((de = ReadDir(dir, dirpath)) != NULL)
 	{
-		RelPathStr	path;
+		const char *name = de->d_name;
+		BlockNumber	segno;

-		path = umfile_segpath(rlocator, forknum, segno);
-		if (unlink(path.str) < 0)
+		if (strcmp(name, basename) == 0)
+			segno = 0;
+		else if (strncmp(name, basename, baselen) == 0 &&
+				 name[baselen] == '.')
 		{
-			if (FILE_POSSIBLY_DELETED(errno))
-			{
-				if (segno == 0 && isRedo)
-					return;
-				break;
-			}
+			char	   *endptr = NULL;
+			unsigned long parsed;
+
+			errno = 0;
+			parsed = strtoul(name + baselen + 1, &endptr, 10);
+			if (errno != 0 ||
+				endptr == name + baselen + 1 ||
+				*endptr != '\0' ||
+				parsed > MaxBlockNumber)
+				continue;
+			segno = (BlockNumber) parsed;
+		}
+		else
+			continue;

-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path.str)));
-			break;
+		if (nsegnos == capacity)
+		{
+			int new_capacity = (capacity == 0) ? 16 : capacity * 2;
+
+			if (segnos == NULL)
+				segnos = (BlockNumber *) MemoryContextAlloc(UmCxt,
+															sizeof(BlockNumber) * new_capacity);
+			else
+				segnos = (BlockNumber *) repalloc(segnos,
+												  sizeof(BlockNumber) * new_capacity);
+			capacity = new_capacity;
 		}
+		segnos[nsegnos++] = segno;
 	}
-}

-static void
-umfile_ctx_registry_init(void)
-{
-	if (UmFileContextHash == NULL)
-		umfile_init();
-
-	Assert(UmFileContextHash != NULL);
-}
-
-static UmbraFileContext *
-umfile_ctx_create(RelFileLocatorBackend rlocator)
-{
-	UmbraFileContext *ctx;
+	FreeDir(dir);

-	Assert(UmFileCxt != NULL);
+	if (nsegnos == 0)
+	{
+		if (segnos != NULL)
+			pfree(segnos);
+		return true;
+	}

-	ctx = MemoryContextAllocZero(UmFileCxt, sizeof(UmbraFileContext));
-	ctx->rlocator = rlocator;
+	qsort(segnos, nsegnos, sizeof(BlockNumber), umfile_compare_blocknumbers);

-	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+	uniq = 1;
+	for (i = 1; i < nsegnos; i++)
 	{
-		ctx->num_open_segs[forknum] = 0;
-		ctx->seg_fds[forknum] = NULL;
+		if (segnos[i] != segnos[uniq - 1])
+			segnos[uniq++] = segnos[i];
 	}

-	return ctx;
+	*segnos_out = segnos;
+	*nsegnos_out = uniq;
+	return true;
 }

+/*
+ * umfile_build_segpath() -- Build segment path in caller-provided buffer.
+ *
+ * This is a no-allocation path builder so callers can use it safely in
+ * critical sections.
+ */
 static void
-umfile_ctx_destroy(UmbraFileContext *ctx)
+umfile_build_segpath(UmbraFileContext *ctx, ForkNumber forknum,
+					 BlockNumber segno, char *path, size_t pathlen)
 {
-	if (ctx == NULL)
-		return;
-
-	for (ForkNumber forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
-		umfile_close_open_segments(ctx, forknum);
-
-	pfree(ctx);
-}
+	int			n;
+	RelFileLocatorBackend rlocator;

-static void
-umfile_close_open_segments(UmbraFileContext *ctx, ForkNumber forknum)
-{
-	int			nopensegs;
+	Assert(forknum >= 0 && forknum <= UMBRA_METADATA_FORKNUM);

-	Assert(ctx != NULL);
+	/* Build RelFileLocatorBackend for use with relpath */
+	rlocator.locator = ctx->rlocator.locator;
+	rlocator.backend = ctx->rlocator.backend;

-	nopensegs = ctx->num_open_segs[forknum];
-	while (nopensegs > 0)
+	/* Build the base path using public forks or Umbra private metadata. */
 	{
-		UmfdVec    *seg = umfile_v_get(ctx, forknum, nopensegs - 1);
+		RelPathStr relpath_str;

-		if (umfile_seg_entry_is_open(seg))
-			FileClose(seg->umfd_vfd);
-		umfile_fdvec_resize(ctx, forknum, nopensegs - 1);
-		nopensegs--;
+		if (forknum == UMBRA_METADATA_FORKNUM)
+			relpath_str = UmMetadataRelPathBackend(rlocator);
+		else
+			relpath_str = relpath(rlocator, forknum);
+		n = strlcpy(path, relpath_str.str, pathlen);
 	}
+
+	if (segno == 0)
+		return;
+
+	Assert(segno < RELSEG_SIZE);
+	snprintf(path + n, pathlen - n, ".%u", segno);
 }

-static bool
-umfile_create(UmbraFileContext *ctx, ForkNumber forknum, bool isRedo)
+static UmfdVec *
+umfile_getseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+			  ForkNumber forknum, BlockNumber blkno,
+			  bool skipFsync, int behavior, bool isTempRelation)
 {
-	RelPathStr	path;
-	File		fd;
-	UmfdVec    *seg;
-	bool		created = false;
+	UmfdVec    *v;
+	BlockNumber targetseg;
+	BlockNumber nextsegno;

-	Assert(ctx != NULL);
+	Assert(behavior &
+		   (UM_EXTENSION_FAIL | UM_EXTENSION_CREATE | UM_EXTENSION_RETURN_NULL |
+			UM_EXTENSION_DONT_OPEN));

-	if (isRedo && ctx->num_open_segs[forknum] > 0)
-		return false;
+	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);

-	if (ctx->num_open_segs[forknum] > 0)
-		umfile_close_open_segments(ctx, forknum);
+	/* if an existing and opened segment, we're done */
+	if (targetseg < (BlockNumber) ctx->num_open_segs[forknum])
+	{
+		v = umfile_v_get(ctx, forknum, (int) targetseg);
+		if (!umfile_fork_allows_sparse_segments(forknum) ||
+			umfile_seg_entry_is_open(v))
+			return v;
+	}

-	TablespaceCreateDbspace(ctx->rlocator.locator.spcOid,
-							ctx->rlocator.locator.dbOid,
-							isRedo);
+	/* The caller only wants the segment if we already had it open. */
+	if (behavior & UM_EXTENSION_DONT_OPEN)
+		return NULL;

-	path = umfile_segpath(ctx->rlocator, forknum, 0);
-	fd = PathNameOpenFile(path.str, umfile_open_flags() | O_CREAT | O_EXCL);
-	if (fd < 0)
+	/*
+	 * Mapped data forks can use sparse physical segment numbering. Open/create
+	 * the target segment directly without checking continuity of previous
+	 * segments.
+	 */
+	if (umfile_fork_allows_sparse_segments(forknum))
 	{
-		int			save_errno = errno;
+		int flags = 0;

-		if (isRedo)
-			fd = PathNameOpenFile(path.str, umfile_open_flags());
-		if (fd < 0)
+		if ((behavior & UM_EXTENSION_CREATE) ||
+			(InRecovery && (behavior & UM_EXTENSION_CREATE_RECOVERY)))
+			flags = O_CREAT;
+
+		v = umfile_openseg(ctx, rlocator, forknum, targetseg, flags);
+		if (v == NULL)
 		{
-			errno = save_errno;
+			if ((behavior & UM_EXTENSION_RETURN_NULL) &&
+				FILE_POSSIBLY_DELETED(errno))
+				return NULL;
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not create file \"%s\": %m", path.str)));
+					 errmsg("could not open file \"%s\" (target block %u): %m",
+							umfile_segpath(rlocator, forknum, targetseg).str,
+							blkno)));
 		}
+		return v;
 	}
+
+	/*
+	 * The target segment is not yet open. Iterate over all the segments between
+	 * the last opened and the target segment.
+	 */
+	if (ctx->num_open_segs[forknum] > 0)
+		v = umfile_v_get(ctx, forknum, ctx->num_open_segs[forknum] - 1);
 	else
-		created = true;
+	{
+		v = umfile_openfork(ctx, rlocator, forknum, behavior);
+		if (!v)
+			return NULL;
+	}

-	umfile_fdvec_resize(ctx, forknum, 1);
-	seg = umfile_v_get(ctx, forknum, 0);
-	seg->umfd_vfd = fd;
-	seg->umfd_segno = 0;
+	for (nextsegno = ctx->num_open_segs[forknum];
+		 nextsegno <= targetseg;
+		 nextsegno++)
+	{
+		BlockNumber	nblocks = umfile_nblocks_in_seg(v->umfd_vfd);
+		int			flags = 0;

-	return created;
-}
+		Assert(nextsegno == v->umfd_segno + 1);

-static int
-umfile_open_flags(void)
-{
-	int			flags = O_RDWR | PG_BINARY;
+		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+			elog(FATAL, "segment too big");

-	if (io_direct_flags & IO_DIRECT_DATA)
-		flags |= PG_O_DIRECT;
+		if ((behavior & UM_EXTENSION_CREATE) ||
+			(InRecovery && (behavior & UM_EXTENSION_CREATE_RECOVERY)))
+		{
+			/*
+			 * Maintain the invariant that segments before the last active
+			 * segment are exactly RELSEG_SIZE blocks. Pad with zeros if needed.
+			 * This can happen e.g. in recovery or for discontiguous extension.
+			 */
+			if (nblocks < ((BlockNumber) RELSEG_SIZE))
+			{
+				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+													 MCXT_ALLOC_ZERO);

-	return flags;
+				umfile_extend(ctx, forknum,
+							  nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
+							  zerobuf, skipFsync);
+				pfree(zerobuf);
+			}
+			flags = O_CREAT;
+		}
+		else if (nblocks < ((BlockNumber) RELSEG_SIZE))
+		{
+			if (behavior & UM_EXTENSION_RETURN_NULL)
+			{
+				errno = ENOENT;
+				return NULL;
+			}
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" (target block %u): previous segment is only %u blocks",
+							umfile_segpath(rlocator, forknum, nextsegno).str,
+							blkno, nblocks)));
+		}
+
+		v = umfile_openseg(ctx, rlocator, forknum, nextsegno, flags);
+		if (v == NULL)
+		{
+			if ((behavior & UM_EXTENSION_RETURN_NULL) &&
+				FILE_POSSIBLY_DELETED(errno))
+				return NULL;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" (target block %u): %m",
+							umfile_segpath(rlocator, forknum, nextsegno).str,
+							blkno)));
+		}
+	}
+
+	return v;
 }

 static void
-umfile_fdvec_resize(UmbraFileContext *ctx, ForkNumber forknum, int nseg)
+umfile_register_dirty_seg(RelFileLocatorBackend rlocator, bool isTempRelation,
+						  ForkNumber forknum, UmfdVec *seg)
 {
-	Assert(ctx != NULL);
-	Assert(nseg >= 0);
+	FileTag		tag;

-	if (nseg == 0)
+	if (!RelFileNumberIsValid(rlocator.locator.relNumber) ||
+		!OidIsValid(rlocator.locator.spcOid))
+		elog(PANIC,
+			 "invalid Umbra relation locator in fsync registration %u/%u/%u fork=%d seg=%u",
+			 rlocator.locator.spcOid,
+			 rlocator.locator.dbOid,
+			 rlocator.locator.relNumber,
+			 forknum,
+			 seg->umfd_segno);
+
+	INIT_UM_FILETAG(tag, rlocator.locator, forknum, seg->umfd_segno);
+
+	/* Temp relations should never be fsync'd */
+	Assert(!isTempRelation);
+
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		instr_time	io_start;
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+
+		if (FileSync(seg->umfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->umfd_vfd))));
+
+		pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+								IOOP_FSYNC, io_start, 1, 0);
+	}
+}
+
+static void
+umfile_register_unlink_seg(RelFileLocatorBackend rlocator, ForkNumber forknum,
+						   BlockNumber segno)
+{
+	FileTag		tag;
+
+	INIT_UM_FILETAG(tag, rlocator.locator, forknum, segno);
+	Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, true /* retryOnError */ );
+}
+
+static void
+umfile_register_forget_seg(RelFileLocatorBackend rlocator, ForkNumber forknum,
+						   BlockNumber segno)
+{
+	FileTag		tag;
+
+	INIT_UM_FILETAG(tag, rlocator.locator, forknum, segno);
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
+}
+
+static void
+umfile_register_dense_existing_segs_for_unlink(RelFileLocatorBackend rlocator,
+											   ForkNumber forknum,
+											   const char *seg0path)
+{
+	char		segpath[MAXPGPATH];
+	BlockNumber segno;
+
+	Assert(seg0path != NULL);
+
+	for (segno = 0;; segno++)
+	{
+		if (segno == 0)
+			strlcpy(segpath, seg0path, sizeof(segpath));
+		else
+			snprintf(segpath, sizeof(segpath), "%s.%u", seg0path, segno);
+
+		if (!pg_file_exists(segpath))
+			break;
+
+		umfile_register_unlink_seg(rlocator, forknum, segno);
+	}
+}
+
+void
+umfile_init(void)
+{
+	HASHCTL info;
+
+	if (UmCxt != NULL)
+		return;
+	UmCxt = AllocSetContextCreate(TopMemoryContext,
+								  "UmFile",
+								  ALLOCSET_DEFAULT_SIZES);
+	/*
+	 * smgr callbacks (including truncate during WAL replay) can run inside a
+	 * critical section. Umbra's per-relation file context is used by openfork
+	 * and fdvec management, so it must be permitted to allocate there.
+	 *
+	 * This matches the expectation in core smgr implementations that their
+	 * internal contexts can allocate while in a critical section.
+	 */
+	MemoryContextAllowInCriticalSection(UmCxt, true);
+
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(RelFileLocatorBackend);
+	info.entrysize = sizeof(UmCtxRegistryEntry);
+	info.hcxt = UmCxt;
+	UmCtxRegistry = hash_create("Umbra file context registry",
+								256,
+								&info,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+static void
+umfile_ctx_registry_init(void)
+{
+	if (UmCxt == NULL)
+		umfile_init();
+
+	Assert(UmCtxRegistry != NULL);
+}
+
+static UmbraFileContext *
+umfile_ctx_create(RelFileLocatorBackend rlocator)
+{
+	UmbraFileContext *ctx;
+
+	ctx = MemoryContextAllocZero(UmCxt, sizeof(UmbraFileContext));
+	ctx->rlocator = rlocator;
+
+	for (int forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
 	{
-		if (ctx->num_open_segs[forknum] > 0)
-			pfree(ctx->seg_fds[forknum]);
-		ctx->seg_fds[forknum] = NULL;
 		ctx->num_open_segs[forknum] = 0;
+		ctx->seg_fds[forknum] = NULL;
+	}
+
+	return ctx;
+}
+
+static void
+umfile_ctx_destroy_internal(UmbraFileContext *ctx)
+{
+	if (ctx == NULL)
+		return;
+
+	for (int forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+	{
+		while (ctx->num_open_segs[forknum] > 0)
+		{
+			UmfdVec *seg = umfile_v_get(ctx, forknum,
+										ctx->num_open_segs[forknum] - 1);
+
+			if (umfile_seg_entry_is_open(seg))
+				FileClose(seg->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, ctx->num_open_segs[forknum] - 1);
+		}
+	}
+
+	pfree(ctx);
+}
+
+void
+umfile_create(UmbraFileContext *ctx, ForkNumber forknum, bool isRedo)
+{
+	RelFileLocatorBackend rlocator;
+	bool		isTempRelation;
+	RelPathStr	path;
+	File		fd;
+	UmfdVec	   *v;
+
+	Assert(ctx != NULL);
+	rlocator = ctx->rlocator;
+	isTempRelation = RelFileLocatorBackendIsTemp(rlocator);
+
+	if (isRedo && ctx->num_open_segs[forknum] > 0)
 		return;
+
+	if (ctx->num_open_segs[forknum] > 0)
+		while (ctx->num_open_segs[forknum] > 0)
+		{
+			UmfdVec *seg = umfile_v_get(ctx, forknum,
+										ctx->num_open_segs[forknum] - 1);
+
+			if (umfile_seg_entry_is_open(seg))
+				FileClose(seg->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, ctx->num_open_segs[forknum] - 1);
+		}
+
+	Assert(ctx->num_open_segs[forknum] == 0);
+
+	TablespaceCreateDbspace(rlocator.locator.spcOid,
+							rlocator.locator.dbOid,
+							isRedo);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		path = UmMetadataRelPathBackend(rlocator);
+	else
+		path = relpath(rlocator, forknum);
+
+	fd = PathNameOpenFile(path.str, _umfd_open_flags() | O_CREAT | O_EXCL);
+	if (fd < 0)
+	{
+		int			save_errno = errno;
+
+		if (isRedo)
+			fd = PathNameOpenFile(path.str, _umfd_open_flags());
+		if (fd < 0)
+		{
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path.str)));
+		}
 	}

-	if (ctx->num_open_segs[forknum] == 0)
+	umfile_fdvec_resize(ctx, forknum, 1);
+	v = umfile_v_get(ctx, forknum, 0);
+	v->umfd_vfd = fd;
+	v->umfd_segno = 0;
+
+	if (!isTempRelation)
+		umfile_register_dirty_seg(rlocator, false, forknum, v);
+}
+
+void
+umfile_ctx_close_fork(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	int			nopensegs;
+
+	if (ctx == NULL)
+		return;
+
+	nopensegs = ctx->num_open_segs[forknum];
+	if (nopensegs == 0)
+		return;
+
+	while (nopensegs > 0)
 	{
-		ctx->seg_fds[forknum] =
-			MemoryContextAlloc(UmFileCxt, sizeof(UmfdVec) * nseg);
+		UmfdVec    *v = umfile_v_get(ctx, forknum, nopensegs - 1);
+
+		if (umfile_seg_entry_is_open(v))
+			FileClose(v->umfd_vfd);
+		umfile_fdvec_resize(ctx, forknum, nopensegs - 1);
+		nopensegs--;
 	}
-	else if (nseg > ctx->num_open_segs[forknum])
+}
+
+bool
+umfile_exists(UmbraFileContext *ctx, ForkNumber forknum, UmFileExistsMode mode)
+{
+	RelFileLocatorBackend rlocator;
+
+	Assert(ctx != NULL);
+	rlocator = ctx->rlocator;
+
+	if (!InRecovery &&
+		umfile_fork_has_open_segment(ctx, forknum))
 	{
-		ctx->seg_fds[forknum] =
-			repalloc(ctx->seg_fds[forknum], sizeof(UmfdVec) * nseg);
+		/*
+		 * Any still-open segment whose path still exists is enough evidence
+		 * that the fork exists. If all open fds are stale after an unlink or
+		 * rewrite, drop them before falling back to slower on-disk probes.
+		 */
+		if (umfile_fork_has_open_segment_on_disk(ctx, rlocator, forknum))
+			return true;
+
+		while (ctx->num_open_segs[forknum] > 0)
+		{
+			UmfdVec *v = umfile_v_get(ctx, forknum, ctx->num_open_segs[forknum] - 1);
+
+			if (umfile_seg_entry_is_open(v))
+				FileClose(v->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, ctx->num_open_segs[forknum] - 1);
+		}
 	}

-	ctx->num_open_segs[forknum] = nseg;
+	if (mode == UMFILE_EXISTS_SPARSE)
+	{
+		/*
+		 * Most sparse forks still keep seg0 around. Probe that first so the
+		 * common case stays as cheap as the legacy exists() path. Only fall
+		 * back to a directory scan when seg0 is absent and the fork may still
+		 * exist solely via higher sparse segments.
+		 */
+		if (umfile_openfork(ctx, rlocator, forknum, UM_EXTENSION_RETURN_NULL) != NULL)
+			return true;
+
+		return umfile_any_segment_exists_by_path(
+			umfile_segpath(rlocator, forknum, 0).str);
+	}
+
+	return (umfile_openfork(ctx, rlocator, forknum, UM_EXTENSION_RETURN_NULL) != NULL);
 }

-static inline UmfdVec *
-umfile_v_get(UmbraFileContext *ctx, ForkNumber forknum, int segindex)
+/*
+ * umfile_open_or_create() -- open existing fork or create new one.
+ *
+ * For redo, attempt to reuse existing file. For normal create, always create
+ * with O_EXCL to avoid binding to stale on-disk contents. Returns true on
+ * success, and sets *created to indicate whether a new file was created.
+ */
+bool
+umfile_open_or_create(UmbraFileContext *ctx, ForkNumber forknum,
+					  bool isRedo, bool *created)
+{
+	UmfdVec    *v;
+
+	if (created)
+		*created = false;
+
+	/*
+	 * Redo can legitimately see pre-existing files and should reuse them.
+	 */
+	if (isRedo)
+	{
+		v = umfile_openfork(ctx, ctx->rlocator, forknum,
+							UM_EXTENSION_RETURN_NULL);
+		if (v != NULL)
+			return true;
+	}
+
+	/* Create new file (with O_EXCL for normal path) */
+	umfile_create(ctx, forknum, isRedo);
+
+	/* Verify creation succeeded */
+	v = umfile_openfork(ctx, ctx->rlocator, forknum,
+						UM_EXTENSION_RETURN_NULL);
+	if (v != NULL)
+	{
+		if (created)
+			*created = true;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Unlink logic mirrors mdunlink(), but uses Umbra segment tracking.
+ */
+void
+umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+{
+	RelPathStr	path;
+	int			ret;
+	int			save_errno;
+
+	if (forknum == InvalidForkNumber)
+	{
+		for (forknum = 0; forknum <= UMBRA_METADATA_FORKNUM; forknum++)
+			umfile_unlink(rlocator, forknum, isRedo);
+		return;
+	}
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		path = UmMetadataRelPathBackend(rlocator);
+	else
+		path = relpath(rlocator, forknum);
+
+	/*
+	 * Keep all MAP segments physically intact until checkpoint-time unlink, so
+	 * remap-related lookup state is preserved throughout the checkpoint window.
+	 */
+	if (!isRedo &&
+		!RelFileLocatorBackendIsTemp(rlocator) &&
+		forknum == UMBRA_METADATA_FORKNUM)
+	{
+		umfile_register_dense_existing_segs_for_unlink(rlocator, forknum,
+														   path.str);
+		return;
+	}
+
+	if (isRedo || IsBinaryUpgrade || forknum != MAIN_FORKNUM ||
+		RelFileLocatorBackendIsTemp(rlocator))
+	{
+		if (!RelFileLocatorBackendIsTemp(rlocator))
+		{
+			ret = pg_truncate(path.str, 0);
+			if (ret < 0 && errno != ENOENT)
+			{
+				save_errno = errno;
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m", path.str)));
+				errno = save_errno;
+			}
+
+			save_errno = errno;
+			umfile_register_forget_seg(rlocator, forknum, 0 /* first seg */ );
+			errno = save_errno;
+		}
+		else
+			ret = 0;
+
+		if (ret >= 0 || errno != ENOENT)
+		{
+			ret = unlink(path.str);
+			if (ret < 0 && errno != ENOENT)
+			{
+				save_errno = errno;
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path.str)));
+				errno = save_errno;
+			}
+		}
+	}
+	else
+	{
+		ret = pg_truncate(path.str, 0);
+		if (ret < 0 && errno != ENOENT)
+		{
+			save_errno = errno;
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path.str)));
+			errno = save_errno;
+		}
+
+		save_errno = errno;
+		umfile_register_unlink_seg(rlocator, forknum, 0 /* first seg */ );
+		errno = save_errno;
+	}
+
+	/* Remove additional segments. */
+	if (ret >= 0 || errno != ENOENT)
+	{
+		char		segpath[MAXPGPATH];
+		BlockNumber segno;
+
+		for (segno = 1;; segno++)
+		{
+			snprintf(segpath, sizeof(segpath), "%s.%u", path.str, segno);
+
+			if (!RelFileLocatorBackendIsTemp(rlocator))
+			{
+				ret = pg_truncate(segpath, 0);
+				save_errno = errno;
+				umfile_register_forget_seg(rlocator, forknum, segno);
+				errno = save_errno;
+			}
+			else
+				ret = 0;
+
+			if (ret < 0 && errno != ENOENT)
+				break;
+
+			ret = unlink(segpath);
+			if (ret < 0)
+			{
+				if (errno != ENOENT)
+				{
+					save_errno = errno;
+					ereport(WARNING,
+							(errcode_for_file_access(),
+							 errmsg("could not remove file \"%s\": %m", segpath)));
+					errno = save_errno;
+				}
+				break;
+			}
+		}
+	}
+}
+
+void
+umfile_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			  const void *buffer, bool skipFsync)
 {
+	RelFileLocatorBackend rlocator;
+	bool		isTempRelation;
+	UmfdVec    *v;
+	off_t		seekpos;
+	int			nbytes;
+
 	Assert(ctx != NULL);
-	Assert(segindex >= 0);
-	Assert(segindex < ctx->num_open_segs[forknum]);
-	return &ctx->seg_fds[forknum][segindex];
+	rlocator = ctx->rlocator;
+	isTempRelation = RelFileLocatorBackendIsTemp(rlocator);
+
+	v = umfile_getseg(ctx, rlocator, forknum, blocknum, skipFsync,
+					  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE,
+					  isTempRelation);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+										 rlocator.locator.spcOid,
+										 rlocator.locator.dbOid,
+										 rlocator.locator.relNumber,
+										 rlocator.backend);
+
+	nbytes = FileWrite(v->umfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND);
+
+	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+										rlocator.locator.spcOid,
+										rlocator.locator.dbOid,
+										rlocator.locator.relNumber,
+										rlocator.backend,
+										nbytes,
+										BLCKSZ);
+
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0 && errno == ENOSPC)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not extend file \"%s\": %m",
+							FilePathName(v->umfd_vfd)),
+					 errhint("Check free disk space.")));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u",
+							FilePathName(v->umfd_vfd), nbytes, BLCKSZ, blocknum)));
+	}
+
+	if (!skipFsync && !isTempRelation)
+		umfile_register_dirty_seg(rlocator, false, forknum, v);
+}
+
+void
+umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+				  int nblocks, bool skipFsync)
+{
+	RelFileLocatorBackend rlocator;
+	bool		isTempRelation;
+
+	Assert(ctx != NULL);
+	rlocator = ctx->rlocator;
+	isTempRelation = RelFileLocatorBackendIsTemp(rlocator);
+
+	while (nblocks > 0)
+	{
+		int			numblocks;
+		off_t		seekpos;
+		UmfdVec    *v;
+		int			ret;
+		int			remblocks;
+		BlockNumber curblocknum;
+
+		curblocknum = blocknum;
+		remblocks = nblocks;
+
+		numblocks = Min(remblocks, RELSEG_SIZE - (curblocknum % RELSEG_SIZE));
+		numblocks = Min(numblocks, PG_IOV_MAX);
+
+			v = umfile_getseg(ctx, rlocator, forknum, curblocknum, skipFsync,
+							  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE,
+							  isTempRelation);
+
+		seekpos = (off_t) BLCKSZ * (curblocknum % ((BlockNumber) RELSEG_SIZE));
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+			ret = FileZero(v->umfd_vfd,
+						   seekpos,
+						   (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+			{
+				int save_errno = errno;
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not extend file \"%s\": %m",
+								FilePathName(v->umfd_vfd)),
+						 errhint(save_errno == ENOSPC ? "Check free disk space." : NULL)));
+			}
+
+		if (!skipFsync && !isTempRelation)
+			umfile_register_dirty_seg(rlocator, false, forknum, v);
+
+		nblocks -= numblocks;
+		blocknum += numblocks;
+	}
+
+}
+
+bool
+umfile_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+				int nblocks)
+{
+	RelFileLocatorBackend rlocator;
+	bool		isTempRelation;
+
+	Assert(ctx != NULL);
+	rlocator = ctx->rlocator;
+	isTempRelation = RelFileLocatorBackendIsTemp(rlocator);
+
+#ifdef USE_PREFETCH
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
+	if ((uint64) blocknum + nblocks > (uint64) MaxBlockNumber + 1)
+		return false;
+
+	while (nblocks > 0)
+	{
+		off_t		seekpos;
+		UmfdVec    *v;
+		int			nblocks_this_segment;
+
+			v = umfile_getseg(ctx, rlocator, forknum, blocknum,
+							  false /* skipFsync */,
+							  InRecovery ? UM_EXTENSION_RETURN_NULL : UM_EXTENSION_FAIL,
+							  isTempRelation);
+		if (v == NULL)
+			return false;
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks, RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+		(void) FilePrefetch(v->umfd_vfd, seekpos, BLCKSZ * nblocks_this_segment,
+							WAIT_EVENT_DATA_FILE_PREFETCH);
+
+		blocknum += nblocks_this_segment;
+		nblocks -= nblocks_this_segment;
+	}
+#endif
+	return true;
+}
+
+uint32
+umfile_maxcombine(ForkNumber forknum, BlockNumber blocknum)
+{
+	uint32		maxblocks;
+
+	(void) forknum;
+	maxblocks = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
+	return maxblocks;
+}
+
+static int
+umfile_buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
+{
+	struct iovec *iovp;
+	int			iovcnt;
+
+	Assert(nblocks >= 1);
+
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
+
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
+
+	/* Try to merge the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
+
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
+
+	return iovcnt;
+}
+
+void
+umfile_readv(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		UmfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = umfile_getseg(ctx, ctx->rlocator,
+						  forknum, blocknum, false /* skipFsync */,
+						  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY,
+						  RelFileLocatorBackendIsTemp(ctx->rlocator));
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, (BlockNumber) lengthof(iov));
+
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "read crosses segment boundary");
+
+		iovcnt = umfile_buffers_to_iovec(iov, buffers, (int) nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		/*
+		 * Inner loop to continue after a short read.  We'll keep going until
+		 * we hit EOF rather than assuming that a short read means we hit the
+		 * end.
+		 */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												ctx->rlocator.locator.spcOid,
+												ctx->rlocator.locator.dbOid,
+												ctx->rlocator.locator.relNumber,
+												ctx->rlocator.backend);
+
+			nbytes = FileReadV(v->umfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   ctx->rlocator.locator.spcOid,
+											   ctx->rlocator.locator.dbOid,
+											   ctx->rlocator.locator.relNumber,
+											   ctx->rlocator.backend,
+											   nbytes,
+											   size_this_segment - transferred_this_segment);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read blocks %u..%u in file \"%s\": %m",
+								blocknum,
+								blocknum + nblocks_this_segment - 1,
+						FilePathName(v->umfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * Mirror mdreadv() behavior: in production builds we can
+				 * zero-fill if zero_damaged_pages or in recovery, but this
+				 * codepath is expected to be unreachable for normal reads.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					Assert(false);	/* see md.c commentary */
+
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+									blocknum,
+									blocknum + nblocks_this_segment - 1,
+									FilePathName(v->umfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short read. */
+			seekpos += nbytes;
+			iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
+
+}
+
+void
+umfile_startreadv(PgAioHandle *ioh, UmbraFileContext *ctx, ForkNumber forknum,
+				  BlockNumber blocknum, void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	UmfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = umfile_getseg(ctx, ctx->rlocator,
+					  forknum, blocknum, false /* skipFsync */,
+					  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY,
+					  RelFileLocatorBackendIsTemp(ctx->rlocator));
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks, RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+	Assert(nblocks <= (BlockNumber) iovcnt);
+
+	iovcnt = umfile_buffers_to_iovec(iov, buffers, (int) nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	ret = FileStartReadV(ioh, v->umfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->umfd_vfd))));
 }

-static BlockNumber
-umfile_nblocks_in_seg(File vfd)
-{
-	pgoff_t		size;
+void
+umfile_startreadv_physical(PgAioHandle *ioh, UmbraFileContext *ctx,
+						   ForkNumber forknum,
+						   BlockNumber logical_blocknum,
+						   BlockNumber physical_blocknum,
+						   void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	UmfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	/*
+	 * Caller is responsible for not crossing physical segment boundaries.
+	 * Umbra MAP translation enforces single-block I/O via ummaxcombine().
+	 */
+	Assert(nblocks >= 1);
+	{
+		v = umfile_getseg(ctx, ctx->rlocator,
+						  forknum, physical_blocknum, false /* skipFsync */,
+						  UM_EXTENSION_FAIL,
+						  RelFileLocatorBackendIsTemp(ctx->rlocator));
+	}
+
+	seekpos = (off_t) BLCKSZ * (physical_blocknum % ((BlockNumber) RELSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (physical_blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+	Assert(nblocks <= (BlockNumber) iovcnt);
+
+	iovcnt = umfile_buffers_to_iovec(iov, buffers, (int) nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	/*
+	 * Preserve logical identity for AIO completion reporting and reopen.
+	 * The started I/O uses physical addressing (file/seekpos).
+	 */
+	{
+		ret = FileStartReadV(ioh, v->umfd_vfd, iovcnt, seekpos,
+							 WAIT_EVENT_DATA_FILE_READ);
+	}
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						logical_blocknum,
+						logical_blocknum + nblocks_this_segment - 1,
+						FilePathName(v->umfd_vfd))));
+}
+
+void
+umfile_writev(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		UmfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = umfile_getseg(ctx, ctx->rlocator,
+						  forknum, blocknum, false /* skipFsync */,
+						  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY,
+						  RelFileLocatorBackendIsTemp(ctx->rlocator));
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks, RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, (BlockNumber) lengthof(iov));
+
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "write crosses segment boundary");
+
+		iovcnt = umfile_buffers_to_iovec(iov, (void **) buffers,
+										 (int) nblocks_this_segment);
+
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 ctx->rlocator.locator.spcOid,
+												 ctx->rlocator.locator.dbOid,
+												 ctx->rlocator.locator.relNumber,
+												 ctx->rlocator.backend);
+
+			nbytes = FileWriteV(v->umfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												ctx->rlocator.locator.spcOid,
+												ctx->rlocator.locator.dbOid,
+												ctx->rlocator.locator.relNumber,
+												ctx->rlocator.backend,
+												nbytes,
+												size_this_segment - transferred_this_segment);
+
+			if (nbytes < 0)
+			{
+				bool		enospc = errno == ENOSPC;
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write blocks %u..%u in file \"%s\": %m",
+								blocknum,
+								blocknum + nblocks_this_segment - 1,
+								FilePathName(v->umfd_vfd)),
+						 enospc ? errhint("Check free disk space.") : 0));
+			}
+
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;

-	size = FileSize(vfd);
-	if (size < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not determine size of file \"%s\": %m",
-						FilePathName(vfd))));
-	if ((size % BLCKSZ) != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("file \"%s\" has partial block contents",
-						FilePathName(vfd)),
-				 errdetail("File size %lld is not a multiple of %d bytes.",
-						   (long long) size, BLCKSZ)));
+			/* Adjust position and vectors after a short write. */
+			seekpos += nbytes;
+			iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !RelFileLocatorBackendIsTemp(ctx->rlocator))
+			umfile_register_dirty_seg(ctx->rlocator,
+									  RelFileLocatorBackendIsTemp(ctx->rlocator),
+									  forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}

- return (BlockNumber) (size / BLCKSZ);
}

-static RelPathStr
-umfile_segpath(RelFileLocatorBackend rlocator, ForkNumber forknum,
-			   BlockNumber segno)
+void
+umfile_writeback(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+				 BlockNumber nblocks)
 {
-	RelPathStr	base;
-	RelPathStr	fullpath;
+	UmfdVec    *v;
+	off_t		seekpos;

-	if (forknum == UMBRA_METADATA_FORKNUM)
-		base = UmMetadataRelPathBackend(rlocator);
-	else
-		base = relpath(rlocator, forknum);
+	while (nblocks > 0)
+	{
+		BlockNumber nflush;

-	if (segno == 0)
-		return base;
+			v = umfile_getseg(ctx, ctx->rlocator,
+							  forknum, blocknum, false /* skipFsync */,
+							  UM_EXTENSION_FAIL | UM_EXTENSION_CREATE_RECOVERY,
+							  RelFileLocatorBackendIsTemp(ctx->rlocator));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);

-	snprintf(fullpath.str, sizeof(fullpath.str), "%s.%u", base.str, segno);
-	return fullpath;
+		nflush = Min(nblocks, (BlockNumber) RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		Assert(nflush >= 1);
+		Assert(nflush <= nblocks);
+
+		FileWriteback(v->umfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
 }

-static UmfdVec *
-umfile_openseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
-			   ForkNumber forknum, BlockNumber segno, int oflags)
+static BlockNumber
+umfile_nblocks_sparse(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+					  ForkNumber forknum)
 {
-	UmfdVec    *seg;
-	RelPathStr	path;
-	File		fd;
-	int			old_nseg;
-
-	Assert(ctx != NULL);
+	UmfdVec    *v;
+	BlockNumber nblocks;
+	BlockNumber	minsegno;
+	BlockNumber	maxsegno;

-	old_nseg = ctx->num_open_segs[forknum];
-	if (segno < (BlockNumber) old_nseg)
-	{
-		seg = umfile_v_get(ctx, forknum, (int) segno);
-		if (umfile_seg_entry_is_open(seg))
-			return seg;
-	}
+	Assert(umfile_fork_allows_sparse_segments(forknum));

-	path = umfile_segpath(rlocator, forknum, segno);
-	fd = PathNameOpenFile(path.str, umfile_open_flags() | oflags);
-	if (fd < 0)
-		return NULL;
+	if (!umfile_sparse_fork_scan_segments(ctx, forknum, &minsegno, &maxsegno))
+		return 0;

-	if (segno >= (BlockNumber) old_nseg)
+	if (maxsegno >= (BlockNumber) ctx->num_open_segs[forknum] ||
+		!umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, (int) maxsegno)))
 	{
-		umfile_fdvec_resize(ctx, forknum, segno + 1);
-		for (int i = old_nseg; i < ctx->num_open_segs[forknum]; i++)
-			umfile_seg_entry_reset(umfile_v_get(ctx, forknum, i));
+		v = umfile_openseg(ctx, rlocator, forknum, maxsegno, 0);
+		if (v == NULL)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							umfile_segpath(rlocator, forknum, maxsegno).str)));
 	}
+	else
+		v = umfile_v_get(ctx, forknum, (int) maxsegno);

-	seg = umfile_v_get(ctx, forknum, (int) segno);
-	seg->umfd_vfd = fd;
-	seg->umfd_segno = segno;
-
-	Assert(umfile_nblocks_in_seg(seg->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
-	return seg;
+	nblocks = umfile_nblocks_in_seg(v->umfd_vfd);
+	if (nblocks > (BlockNumber) RELSEG_SIZE)
+		elog(FATAL, "segment too big");
+	return (maxsegno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
 }

-static UmfdVec *
-umfile_openfork(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
-				ForkNumber forknum, int behavior)
+static BlockNumber
+umfile_nblocks_dense(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
+					 ForkNumber forknum)
 {
-	RelPathStr	path;
-	File		fd;
-	UmfdVec    *seg;
+	UmfdVec    *v;
+	BlockNumber nblocks;
+	BlockNumber segno;

-	Assert(ctx != NULL);
+	/*
+	 * Match md.c semantics: missing forks read as size 0.
+	 *
+	 * This is relied on by size-reporting code paths (pg_table_size, psql \d+),
+	 * and by callers that probe optional forks without doing smgrexists() first.
+	 */
+	if (umfile_openfork(ctx, rlocator, forknum, UM_EXTENSION_RETURN_NULL) == NULL)
+		return 0;
+	Assert(ctx->num_open_segs[forknum] > 0);

-	if (ctx->num_open_segs[forknum] > 0)
-	{
-		seg = umfile_v_get(ctx, forknum, 0);
-		if (umfile_seg_entry_is_open(seg))
-			return seg;
-	}
+	segno = ctx->num_open_segs[forknum] - 1;
+	v = umfile_v_get(ctx, forknum, segno);

-	path = umfile_segpath(rlocator, forknum, 0);
-	fd = PathNameOpenFile(path.str, umfile_open_flags());
-	if (fd < 0)
+	for (;;)
 	{
-		if ((behavior & UM_EXTENSION_RETURN_NULL) &&
-			FILE_POSSIBLY_DELETED(errno))
-			return NULL;
+		nblocks = umfile_nblocks_in_seg(v->umfd_vfd);
+		if (nblocks > (BlockNumber) RELSEG_SIZE)
+			elog(FATAL, "segment too big");
+		if (nblocks < (BlockNumber) RELSEG_SIZE)
+			return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;

-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path.str)));
+		segno++;
+		v = umfile_openseg(ctx, rlocator, forknum, segno, 0);
+		if (v == NULL)
+			return segno * ((BlockNumber) RELSEG_SIZE);
 	}
-
-	if (ctx->num_open_segs[forknum] == 0)
-		umfile_fdvec_resize(ctx, forknum, 1);
-	seg = umfile_v_get(ctx, forknum, 0);
-	seg->umfd_vfd = fd;
-	seg->umfd_segno = 0;
-
-	Assert(umfile_nblocks_in_seg(seg->umfd_vfd) <= (BlockNumber) RELSEG_SIZE);
-	return seg;
 }

-static UmfdVec *
-umfile_getseg(UmbraFileContext *ctx, RelFileLocatorBackend rlocator,
-			  ForkNumber forknum, BlockNumber blkno,
-			  bool skipFsync, int behavior)
+BlockNumber
+umfile_nblocks(UmbraFileContext *ctx, ForkNumber forknum, UmFileNblocksMode mode)
 {
-	UmfdVec    *seg;
-	BlockNumber	targetseg;
-	BlockNumber	nextsegno;
+	RelFileLocatorBackend rlocator;

 	Assert(ctx != NULL);
-	Assert(behavior &
-		   (UM_EXTENSION_FAIL | UM_EXTENSION_CREATE |
-			UM_EXTENSION_RETURN_NULL | UM_EXTENSION_DONT_OPEN));
+	rlocator = ctx->rlocator;

-	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	if (mode == UMFILE_NBLOCKS_SPARSE)
+		return umfile_nblocks_sparse(ctx, rlocator, forknum);

-	if (targetseg < (BlockNumber) ctx->num_open_segs[forknum])
-	{
-		seg = umfile_v_get(ctx, forknum, (int) targetseg);
-		if (umfile_seg_entry_is_open(seg))
-			return seg;
-	}
+	return umfile_nblocks_dense(ctx, rlocator, forknum);
+}

-	if (behavior & UM_EXTENSION_DONT_OPEN)
-		return NULL;
+void
+umfile_truncate(UmbraFileContext *ctx, ForkNumber forknum,
+				BlockNumber curnblk, BlockNumber nblocks)
+{
+	BlockNumber priorblocks;
+	int			curopensegs;

-	if (ctx->num_open_segs[forknum] > 0)
-		seg = umfile_v_get(ctx, forknum, ctx->num_open_segs[forknum] - 1);
-	else
+	if (nblocks > curnblk)
 	{
-		seg = umfile_openfork(ctx, rlocator, forknum, behavior);
-		if (seg == NULL)
-			return NULL;
+		if (InRecovery)
+			return;
+		ereport(ERROR,
+				(errmsg("could not truncate file \"%s\" to %u blocks: it's only %u blocks now",
+						relpath(ctx->rlocator, forknum).str,
+						nblocks, curnblk)));
 	}
+	if (nblocks == curnblk)
+		return;

-	for (nextsegno = ctx->num_open_segs[forknum];
-		 nextsegno <= targetseg;
-		 nextsegno++)
+	curopensegs = ctx->num_open_segs[forknum];
+	while (curopensegs > 0)
 	{
-		BlockNumber	nblocks;
-		int			flags = 0;
+		UmfdVec    *v;

-		Assert(nextsegno == seg->umfd_segno + 1);
+		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
+		v = umfile_v_get(ctx, forknum, curopensegs - 1);

-		nblocks = umfile_nblocks_in_seg(seg->umfd_vfd);
-		if (nblocks > (BlockNumber) RELSEG_SIZE)
-			elog(FATAL, "Umbra segment too big");
+		if (priorblocks > nblocks)
+		{
+			if (FileTruncate(v->umfd_vfd, 0, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(v->umfd_vfd))));

-		if ((behavior & UM_EXTENSION_CREATE) ||
-			(InRecovery && (behavior & UM_EXTENSION_CREATE_RECOVERY)))
+			if (!RelFileLocatorBackendIsTemp(ctx->rlocator))
+				umfile_register_dirty_seg(ctx->rlocator,
+										  RelFileLocatorBackendIsTemp(ctx->rlocator),
+										  forknum, v);
+
+			Assert(v != umfile_v_get(ctx, forknum, 0));
+
+			FileClose(v->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, curopensegs - 1);
+		}
+		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
 		{
-			if (nblocks < (BlockNumber) RELSEG_SIZE)
-			{
-				char	   *zerobuf;
+			BlockNumber lastsegblocks = nblocks - priorblocks;

-				zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
-										 MCXT_ALLOC_ZERO);
-				umfile_extend(ctx, forknum,
-							  nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
-							  zerobuf, skipFsync);
-				pfree(zerobuf);
-			}
-			flags = O_CREAT;
+			if (FileTruncate(v->umfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
+								FilePathName(v->umfd_vfd),
+								nblocks)));
+
+			if (!RelFileLocatorBackendIsTemp(ctx->rlocator))
+				umfile_register_dirty_seg(ctx->rlocator,
+										  RelFileLocatorBackendIsTemp(ctx->rlocator),
+										  forknum, v);
 		}
-		else if (nblocks < (BlockNumber) RELSEG_SIZE)
+		else
 		{
-			if (behavior & UM_EXTENSION_RETURN_NULL)
+			break;
+		}
+		curopensegs--;
+	}
+}
+
+void
+umfile_registersync(UmbraFileContext *ctx, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	if (umfile_fork_allows_sparse_segments(forknum))
+	{
+		RelPathStr	path = umfile_segpath(ctx->rlocator, forknum, 0);
+		BlockNumber *segnos = NULL;
+		int			nsegnos = 0;
+		int			i;
+
+		if (!umfile_collect_existing_segnos_by_path(path.str, &segnos, &nsegnos))
+			return;
+
+		for (i = 0; i < nsegnos; i++)
+		{
+			BlockNumber	curseg = segnos[i];
+			UmfdVec	   *v;
+
+			if (curseg < (BlockNumber) ctx->num_open_segs[forknum] &&
+				umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, (int) curseg)))
+				v = umfile_v_get(ctx, forknum, (int) curseg);
+			else
 			{
-				errno = ENOENT;
-				return NULL;
+				v = umfile_openseg(ctx, ctx->rlocator, forknum, curseg, 0);
+				if (v == NULL)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m",
+									umfile_segpath(ctx->rlocator, forknum, curseg).str)));
 			}

-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\" (target block %u): previous segment is only %u blocks",
-							umfile_segpath(rlocator, forknum, nextsegno).str,
-							blkno, nblocks)));
+			umfile_register_dirty_seg(ctx->rlocator,
+									  RelFileLocatorBackendIsTemp(ctx->rlocator),
+									  forknum, v);
 		}

-		seg = umfile_openseg(ctx, rlocator, forknum, nextsegno, flags);
-		if (seg == NULL)
-		{
-			if ((behavior & UM_EXTENSION_RETURN_NULL) &&
-				FILE_POSSIBLY_DELETED(errno))
-				return NULL;
+		if (segnos != NULL)
+			pfree(segnos);
+		return;
+	}

-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\" (target block %u): %m",
-							umfile_segpath(rlocator, forknum, nextsegno).str,
-							blkno)));
+	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+
+	min_inactive_seg = segno = ctx->num_open_segs[forknum];
+
+	while (umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		UmfdVec    *v = umfile_v_get(ctx, forknum, segno - 1);
+
+		umfile_register_dirty_seg(ctx->rlocator,
+								  RelFileLocatorBackendIsTemp(ctx->rlocator),
+								  forknum, v);
+
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, segno - 1);
 		}
-	}

-	return seg;
+		segno--;
+	}
 }

-static bool
-umfile_fork_has_open_segment(UmbraFileContext *ctx, ForkNumber forknum)
+void
+umfile_immedsync(UmbraFileContext *ctx, ForkNumber forknum)
 {
-	Assert(ctx != NULL);
+	int			segno;
+	int			min_inactive_seg;

-	for (int i = 0; i < ctx->num_open_segs[forknum]; i++)
+	if (umfile_fork_allows_sparse_segments(forknum))
 	{
-		if (umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, i)))
-			return true;
+		RelPathStr	path = umfile_segpath(ctx->rlocator, forknum, 0);
+		BlockNumber *segnos = NULL;
+		int			nsegnos = 0;
+		int			i;
+
+		if (!umfile_collect_existing_segnos_by_path(path.str, &segnos, &nsegnos))
+			return;
+
+		for (i = 0; i < nsegnos; i++)
+		{
+			BlockNumber	curseg = segnos[i];
+			UmfdVec	   *v;
+
+			if (curseg < (BlockNumber) ctx->num_open_segs[forknum] &&
+				umfile_seg_entry_is_open(umfile_v_get(ctx, forknum, (int) curseg)))
+				v = umfile_v_get(ctx, forknum, (int) curseg);
+			else
+			{
+				v = umfile_openseg(ctx, ctx->rlocator, forknum, curseg, 0);
+				if (v == NULL)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m",
+									umfile_segpath(ctx->rlocator, forknum, curseg).str)));
+			}
+
+			if (FileSync(v->umfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								FilePathName(v->umfd_vfd))));
+		}
+
+		if (segnos != NULL)
+			pfree(segnos);
+		return;
 	}

-	return false;
-}
+	(void) umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);

-static bool
-umfile_fork_has_open_segment_on_disk(UmbraFileContext *ctx,
-									 RelFileLocatorBackend rlocator,
-									 ForkNumber forknum)
-{
-	bool		have_live = false;
+	min_inactive_seg = segno = ctx->num_open_segs[forknum];

-	Assert(ctx != NULL);
+	while (umfile_openseg(ctx, ctx->rlocator, forknum, segno, 0) != NULL)
+		segno++;

-	for (int i = 0; i < ctx->num_open_segs[forknum]; i++)
+	while (segno > 0)
 	{
-		UmfdVec    *seg = umfile_v_get(ctx, forknum, i);
-		RelPathStr	path;
+		UmfdVec    *v = umfile_v_get(ctx, forknum, segno - 1);

-		if (!umfile_seg_entry_is_open(seg))
-			continue;
+		if (FileSync(v->umfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(v->umfd_vfd))));

-		path = umfile_segpath(rlocator, forknum, seg->umfd_segno);
-		if (access(path.str, F_OK) == 0)
+		if (segno > min_inactive_seg)
 		{
-			have_live = true;
-			continue;
+			FileClose(v->umfd_vfd);
+			umfile_fdvec_resize(ctx, forknum, segno - 1);
 		}

-		FileClose(seg->umfd_vfd);
-		umfile_seg_entry_reset(seg);
+		segno--;
 	}

- return have_live;
}

-static inline bool
-umfile_seg_entry_is_open(const UmfdVec *seg)
+int
+umfile_fd(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 {
-	return (seg != NULL && seg->umfd_vfd >= 0);
-}
+	UmfdVec    *v;

-static inline void
-umfile_seg_entry_reset(UmfdVec *seg)
-{
-	seg->umfd_vfd = -1;
-	seg->umfd_segno = InvalidBlockNumber;
+	/*
+	 * Sparse mapped forks can address a target segment even when segment 0 is
+	 * absent. Reopen the specific target segment directly instead of insisting
+	 * that segment 0 exists.
+	 */
+	if (!umfile_fork_allows_sparse_segments(forknum))
+		(void) umfile_openfork(ctx, ctx->rlocator, forknum, UM_EXTENSION_FAIL);
+
+	v = umfile_getseg(ctx, ctx->rlocator,
+					  forknum, blocknum, false /* skipFsync */,
+					  UM_EXTENSION_FAIL,
+					  RelFileLocatorBackendIsTemp(ctx->rlocator));
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->umfd_vfd);
 }
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 2c964b6f3d..51ed171c33 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -29,6 +29,9 @@
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "storage/md.h"
+#ifdef USE_UMBRA
+#include "storage/umbra.h"
+#endif
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/wait_event.h"
@@ -115,7 +118,14 @@ static const SyncOps syncsw[] = {
 	/* pg_multixact/members */
 	[SYNC_HANDLER_MULTIXACT_MEMBER] = {
 		.sync_syncfiletag = multixactmemberssyncfiletag
-	}
+	},
+#ifdef USE_UMBRA
+	[SYNC_HANDLER_UMBRA] = {
+		.sync_syncfiletag = umsyncfiletag,
+		.sync_unlinkfiletag = umunlinkfiletag,
+		.sync_filetagmatches = umfiletagmatches
+	},
+#endif
 };

 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7bda529855..a1de5a08d4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -387,6 +387,7 @@ ReplicationOriginState	"Waiting to read or update the progress of one replicatio
 ReplicationSlotIO	"Waiting for I/O on a replication slot."
 LockFastPath	"Waiting to read or update a process' fast-path lock information."
 BufferMapping	"Waiting to associate a data block with a buffer in the buffer pool."
+MapBufferContent	"Waiting to read or update an Umbra metadata map cache page."
 LockManager	"Waiting to read or update information about <quote>heavyweight</quote> locks."
 PredicateLockManager	"Waiting to access predicate lock information used by serializable transactions."
 ParallelHashJoin	"Waiting to synchronize workers during Parallel Hash Join plan execution."
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6f074013aa..62476de48e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -823,10 +823,10 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		before_shmem_exit(ShutdownXLOG, 0);
 	}

-	/*
-	 * Initialize the relation cache and the system catalog caches.  Note that
-	 * no catalog access happens here; we only set up the hashtable structure.
-	 * We must do this before starting a transaction because transaction abort
+		/*
+		 * Initialize the relation cache and the system catalog caches.  Note that
+		 * no catalog access happens here; we only set up the hashtable structure.
+		 * We must do this before starting a transaction because transaction abort
 	 * would try to touch these hashtables.
 	 */
 	RelationCacheInitialize();
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6..a8a476c19d 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -110,6 +110,7 @@ PG_LWLOCKTRANCHE(REPLICATION_ORIGIN_STATE, ReplicationOriginState)
 PG_LWLOCKTRANCHE(REPLICATION_SLOT_IO, ReplicationSlotIO)
 PG_LWLOCKTRANCHE(LOCK_FASTPATH, LockFastPath)
 PG_LWLOCKTRANCHE(BUFFER_MAPPING, BufferMapping)
+PG_LWLOCKTRANCHE(MAP_BUFFER_CONTENT, MapBufferContent)
 PG_LWLOCKTRANCHE(LOCK_MANAGER, LockManager)
 PG_LWLOCKTRANCHE(PREDICATE_LOCK_MANAGER, PredicateLockManager)
 PG_LWLOCKTRANCHE(PARALLEL_HASH_JOIN, ParallelHashJoin)
diff --git a/src/include/storage/map.h b/src/include/storage/map.h
index b0887794c3..b4f6063f35 100644
--- a/src/include/storage/map.h
+++ b/src/include/storage/map.h
@@ -1,53 +1,254 @@
 /*-------------------------------------------------------------------------
  *
  * map.h
- *	  Umbra metadata-fork disk layout helpers.
+ *	  physical map layer: logical block number to physical block number mapping
  *
- * This header defines the stable on-disk page layout and address translation
- * helpers for Umbra's relation-local metadata file.
- *
- * src/include/storage/map.h
+ * This header defines MAP metadata page layout, shared cache APIs, and address
+ * translation helpers for Umbra relation-local metadata.
  *
  *-------------------------------------------------------------------------
  */
 #ifndef MAP_H
 #define MAP_H

+#include "access/xlogdefs.h"
+#include "lib/ilist.h"
+#include "port/atomics.h"
 #include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/mapsuper.h"
 #include "storage/relfilelocator.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/spin.h"
+#include "storage/umfile.h"
+
+/* Forward declarations */
+typedef struct MapSharedData MapSharedData;
+typedef struct MapBufferDesc MapBufferDesc;

+/* Map buffer configuration */
 #define MAP_ENTRIES_PER_PAGE (BLCKSZ / sizeof(uint32))
+#define MAP_SUPERBLOCK_MIN_ENTRIES 50000

 /*
- * Umbra metadata file page layout:
- * - block 0: superblock payload
+ * Umbra metadata fork page layout:
+ * - block 0: superblock (512-byte payload)
  * - blocks 1..: repeated proportional groups
  *
- * Each group reserves one FSM map page, one VM map page, and 8192 MAIN map
- * pages. That keeps the mapping formula stable while leaving room for the
- * auxiliary forks to grow alongside MAIN.
+ * Each proportional group contains:
+ * - 1 FSM map page
+ * - 1 VM map page
+ * - 8192 MAIN map pages
+ *
+ * This keeps MAIN close to the front of the file while preserving a stable
+ * formula-based layout and reserving room for auxiliary forks as MAIN grows.
  */
-#define MAP_BLOCK_SUPER		0
-#define MAP_BLOCK_FIRST_GROUP	1
-#define MAP_GROUP_FSM_PAGES	1
-#define MAP_GROUP_VM_PAGES	1
-#define MAP_GROUP_MAIN_PAGES	8192
+#define MAP_BLOCK_SUPER        0
+#define MAP_BLOCK_FIRST_GROUP  1
+#define MAP_GROUP_FSM_PAGES    1
+#define MAP_GROUP_VM_PAGES     1
+#define MAP_GROUP_MAIN_PAGES   8192
 #define MAP_GROUP_TOTAL_PAGES \
 	(MAP_GROUP_FSM_PAGES + MAP_GROUP_VM_PAGES + MAP_GROUP_MAIN_PAGES)

+/* Map buffer state bits */
+#define MAPBUF_VALID_MASK  0x000001FF	/* refcount (max 511) */
+#define MAPBUF_USAGE_COUNT_SHIFT  9
+#define MAPBUF_USAGE_COUNT_MASK  0x00003E00
+#define MAPBUF_DIRTY       0x00004000
+#define MAPBUF_IO_IN_PROGRESS 0x00008000
+#define MAPBUF_IO_ERROR    0x00010000
+#define MAPBUF_JUST_DIRTIED 0x00020000
+#define MAPBUF_NOT_MATERIALIZED 0x00040000
+#define MAPBUF_CHECKPOINT_NEEDED 0x00080000
+
+#define MAPBUF_GET_REFCOUNT(state) \
+	((state) & MAPBUF_VALID_MASK)
+#define MAPBUF_GET_USAGECOUNT(state) \
+	(((state) & MAPBUF_USAGE_COUNT_MASK) >> MAPBUF_USAGE_COUNT_SHIFT)
+#define MAPBUF_USAGECOUNT_ONE  0x00000200
+
+/* Map page: a pure array of pblkno values */
 typedef struct MapPage
 {
 	uint32		pblknos[MAP_ENTRIES_PER_PAGE];
 } MapPage;

-extern void MapPageInit(MapPage *page);
-extern BlockNumber MapPageGetEntry(const MapPage *page, int entry_idx);
-extern void MapPageSetEntry(MapPage *page, int entry_idx, BlockNumber pblkno);

-extern BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
-											  BlockNumber fork_page_idx);
-extern BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
-extern bool MapDecodeMapBlkno(BlockNumber map_blkno, ForkNumber *forknum,
-							  BlockNumber *fork_page_idx);
+/* Shared memory control structure */
+typedef struct MapSharedData
+{
+	/* clock sweep algorithm */
+	pg_atomic_uint32 next_victim_buffer;
+	slock_t		clock_lock;
+	int			first_free_buffer;	/* head of free list, -1 if empty */
+
+	/* statistics */
+	pg_atomic_uint32 num_allocs;
+	uint32		complete_passes;
+
+	/* configuration */
+	int			num_slots;
+} MapSharedData;
+
+/* Values for freeNext field */
+#define FREENEXT_END_OF_LIST  (-1)
+#define FREENEXT_NOT_IN_LIST  (-2)
+
+/*
+ * MapBufferDesc -- shared descriptor/state data for a single map buffer.
+ */
+typedef struct MapBufferDesc
+{
+	RelFileLocator rnode;	/* relation identifier */
+	ForkNumber	forknum;	/* fork number */
+	int			page_number; /* map page number in this slot, -1 if empty */
+	XLogRecPtr	page_lsn;	/* LSN of last modification */
+	int			id;			/* slot ID */
+	pg_atomic_uint32 state; /* state flags */
+	int			freeNext;	/* next buffer in free list */
+	int			wait_backend_pid;	/* backend PID of pin-count waiter */
+	LWLock		buffer_lock;		/* lock for buffer content access */
+	LWLock		io_in_progress_lock; /* lock for buffer I/O state */
+} MapBufferDesc;
+
+
+extern void MapBackendInit(void);
+extern const ShmemCallbacks MapShmemCallbacks;
+
+/* Lookup/modification */
+extern bool MapTryLookup(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						 ForkNumber forknum, BlockNumber lblkno,
+						 BlockNumber *pblkno);
+extern BlockNumber MapTryLookupPblkRun(UmbraFileContext *map_ctx,
+									   RelFileLocator rnode,
+									   ForkNumber forknum,
+									   BlockNumber lblkno,
+									   BlockNumber maxblocks,
+									   BlockNumber *start_pblkno);/* Buffer management */
+extern int	MapReadBuffer(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						  ForkNumber forknum, BlockNumber map_blkno);
+
+/* MAP superblock helpers */
+extern void MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						  XLogRecPtr map_lsn);
+extern bool MapSBlockEnsureLoaded(UmbraFileContext *map_ctx, RelFileLocator rnode);
+extern bool MapSBlockTryGetLogicalNblocks(UmbraFileContext *map_ctx,
+										  RelFileLocator rnode,
+										  ForkNumber forknum,
+										  BlockNumber *nblocks);
+extern bool MapSBlockForkExists(UmbraFileContext *map_ctx,
+								RelFileLocator rnode,
+								ForkNumber forknum);
+extern bool MapSBlockTryGetNextFreePhysBlock(UmbraFileContext *map_ctx,
+											 RelFileLocator rnode,
+											 ForkNumber forknum,
+											 BlockNumber *next_free_pblk);
+extern bool MapSBlockTryGetPhysicalNblocks(UmbraFileContext *map_ctx,
+										   RelFileLocator rnode,
+										   ForkNumber forknum,
+										   BlockNumber *nblocks);
+extern void MapSBlockBumpLogicalNblocks(UmbraFileContext *map_ctx,
+										RelFileLocator rnode,
+										ForkNumber forknum,
+										BlockNumber nblocks,
+										XLogRecPtr map_lsn);
+extern void MapSBlockBumpPhysicalNblocks(UmbraFileContext *map_ctx,
+										 RelFileLocator rnode,
+										 ForkNumber forknum,
+										 BlockNumber nblocks,
+										 XLogRecPtr map_lsn);
+extern bool MapSBlockEnsurePhysicalNblocks(UmbraFileContext *map_ctx,
+										   RelFileLocator rnode,
+										   ForkNumber forknum,
+										   BlockNumber nblocks,
+										   bool skipFsync);
+extern void MapSBlockBumpNextFreePhysBlock(UmbraFileContext *map_ctx,
+										   RelFileLocator rnode,
+										   ForkNumber forknum,
+										   BlockNumber next_free_pblk,
+										   XLogRecPtr map_lsn);
+extern void MapSBlockSetLogicalNblocks(UmbraFileContext *map_ctx,
+									   RelFileLocator rnode,
+									   ForkNumber forknum,
+									   BlockNumber nblocks,
+									   XLogRecPtr map_lsn);
+extern void MapSBlockSetSkipWalPending(UmbraFileContext *map_ctx,
+									   RelFileLocator rnode,
+									   bool pending,
+									   XLogRecPtr map_lsn);
+extern bool MapSBlockIsSkipWalPending(UmbraFileContext *map_ctx,
+									  RelFileLocator rnode);
+
+/* Checkpoint interface */
+extern void MapPreCheckpoint(void);
+extern void MapCheckpoint(void);
+extern void MapCheckpointRelation(RelFileLocator rnode);
+extern void MapCheckpointDatabaseTablespaces(Oid dbid, int ntablespaces,
+											 const Oid *tablespace_ids);
+extern void MapPostCheckpoint(void);
+extern int	MapBgWriterFlush(int max_pages);
+extern void MapAbortBufferIO(void);
+extern void MapBackendExitCleanup(void);
+
+/* Relation lifecycle */
+extern void MapDrop(RelFileLocator rnode);
+extern void MapTruncate(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						ForkNumber forknum, BlockNumber n_lblknos,
+						XLogRecPtr map_lsn);
+extern void MapPreloadTruncatePages(UmbraFileContext *map_ctx,
+									RelFileLocator rnode,
+									ForkNumber forknum,
+									BlockNumber n_lblknos);
+extern void MapReleasePreloadedTruncatePages(RelFileLocator rnode,
+											 ForkNumber forknum);
+extern void MapInvalidateRelation(RelFileLocator rnode);
+extern void MapInvalidateDatabaseTablespaces(Oid dbid, int ntablespaces,
+											 const Oid *tablespace_ids);
+extern void MapInvalidateDatabase(Oid dbid);
+
+/* Scan helpers */
+extern BlockNumber MapGetLogicalBlockCount(UmbraFileContext *map_ctx,
+										   RelFileLocator rnode,
+										   ForkNumber forknum);
+extern BlockNumber MapGetPhysicalBlockCount(UmbraFileContext *map_ctx,
+											RelFileLocator rnode,
+											ForkNumber forknum,
+											BlockNumber n_lblknos);
+
+/* Clock algorithm */
+extern int	MapClockGetBuffer(void);
+extern void MapClockFreeBuffer(int slot_id);
+extern int	MapSyncStart(uint32 *complete_passes, uint32 *num_allocs);
+
+/* Map cache hash table (in mapclock.c) */
+extern int	MapCacheLookup(RelFileLocator rnode, ForkNumber forknum,
+						   BlockNumber map_blkno);
+extern int	MapCacheInsert(RelFileLocator rnode, ForkNumber forknum,
+						   BlockNumber map_blkno, int slot_id);
+extern void MapCacheDelete(RelFileLocator rnode, ForkNumber forknum,
+						   BlockNumber map_blkno, int slot_id);
+
+/* Buffer pin/unpin */
+extern void MapPinBuffer(int slot_id, bool adjust_usage);
+extern void MapUnpinBuffer(int slot_id);
+extern void MapInvalidateBuffer(int slot_id, RelFileLocator expected_rnode,
+								ForkNumber expected_forknum,
+								BlockNumber expected_map_blkno);
+
+/* GUCs */
+extern int	map_buffers;
+extern int	map_superblocks;
+
+/* Global data (defined in map.c) */
+extern MapSharedData *MapShared;
+extern MapBufferDesc *MapBuffers;
+extern char *MapPageData;		/* actual page data (contiguous block) */
+
+#define MapGetPage(slot_id) \
+	((MapPage *) (MapPageData + ((slot_id) * BLCKSZ)))

 #endif							/* MAP_H */
diff --git a/src/include/storage/map_internal.h b/src/include/storage/map_internal.h
new file mode 100644
index 0000000000..8a2ee89deb
--- /dev/null
+++ b/src/include/storage/map_internal.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * map_internal.h
+ *	  Internal interfaces shared by split MAP implementation files.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MAP_INTERNAL_H
+#define MAP_INTERNAL_H
+
+#include "storage/map.h"
+
+extern void MapEnsurePrivateRefCount(void);
+extern void MapCacheTableShmemRequest(void);
+extern void MapCacheTableShmemInit(void);
+extern void MapBufferUpdateStateBits(MapBufferDesc *buf, uint32 set_bits,
+									 uint32 clear_bits);
+extern void MapMarkBufferDirty(UmbraFileContext *map_ctx, MapBufferDesc *buf,
+							   XLogRecPtr page_lsn);
+extern bool MapStartBufferIO(MapBufferDesc *buf, uint32 required_bits);
+extern void MapTerminateBufferIO(MapBufferDesc *buf, bool clear_dirty,
+								 uint32 set_flag_bits);
+extern void MapFlushBuffer(int slot_id);
+extern void MapResetAllTruncatePreloads(void);
+extern BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
+											  BlockNumber fork_page_idx);
+extern BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
+#endif							/* MAP_INTERNAL_H */
diff --git a/src/include/storage/mapsuper.h b/src/include/storage/mapsuper.h
index 1f6a5dca5a..3421c0ba58 100644
--- a/src/include/storage/mapsuper.h
+++ b/src/include/storage/mapsuper.h
@@ -1,12 +1,11 @@
 /*-------------------------------------------------------------------------
  *
  * mapsuper.h
- *	  Umbra metadata superblock helpers.
+ *	  MAP superblock metadata helpers.
  *
- * The superblock is stored in metadata block 0. Its first 512 bytes contain a
- * versioned payload plus CRC, and the remainder of the block is reserved.
- *
- * src/include/storage/mapsuper.h
+ * The on-disk layout is a 512-byte sector:
+ * - first 64 bytes: MapSuperblockData payload
+ * - remaining 448 bytes: zero padding
  *
  *-------------------------------------------------------------------------
  */
@@ -14,9 +13,9 @@
 #define MAPSUPER_H

#include "access/xlogdefs.h"
+#include "common/relpath.h"
#include "port/pg_crc32c.h"
#include "storage/block.h"
-#include "storage/smgr.h"

#define MAP_SUPERBLOCK_MAGIC 0x554D4252U /* "UMBR" */
#define MAP_SUPERBLOCK_VERSION 1U
@@ -27,11 +26,13 @@

typedef struct pg_attribute_packed() MapSuperblockData
{
+ /* identity/version */
uint32 magic;
uint32 version;
uint32 blcksz;
uint32 flags;

+ /* physical allocator state */
BlockNumber next_free_phys_block_main;
BlockNumber phys_capacity_main;
BlockNumber next_free_phys_block_fsm;
@@ -39,10 +40,12 @@ typedef struct pg_attribute_packed() MapSuperblockData
BlockNumber next_free_phys_block_vm;
BlockNumber phys_capacity_vm;

+ /* logical block count cache */
BlockNumber logical_nblocks_main;
BlockNumber logical_nblocks_fsm;
BlockNumber logical_nblocks_vm;

+ /* crash-safety metadata */
XLogRecPtr last_updated_lsn;
pg_crc32c crc;
} MapSuperblockData;
@@ -88,13 +91,10 @@ extern BlockNumber MapSuperblockGetLogicalNblocks(const MapSuperblock *super,
extern void MapSuperblockSetLogicalNblocks(MapSuperblock *super, ForkNumber forknum,
BlockNumber nblocks);

-extern void MapSuperblockPackPage(const MapSuperblock *super, char page[BLCKSZ]);
-extern void MapSuperblockUnpackPage(MapSuperblock *super, const char page[BLCKSZ]);
-
-extern bool MapSBlockRead(SMgrRelation reln, MapSuperblock *super);
-extern void MapSBlockWrite(SMgrRelation reln, const MapSuperblock *super,
-						   bool skipFsync);
-extern void MapSBlockInitNew(SMgrRelation reln, uint32 flags, XLogRecPtr lsn,
-							 bool skipFsync);
+/* 512-byte sector I/O helpers */
+extern void MapSuperblockPackSector(const MapSuperblock *super,
+									char sector[MAP_SUPERBLOCK_SIZE]);
+extern void MapSuperblockUnpackSector(MapSuperblock *super,
+									  const char sector[MAP_SUPERBLOCK_SIZE]);

 #endif							/* MAPSUPER_H */
diff --git a/src/include/storage/mapsuper_internal.h b/src/include/storage/mapsuper_internal.h
new file mode 100644
index 0000000000..960469538f
--- /dev/null
+++ b/src/include/storage/mapsuper_internal.h
@@ -0,0 +1,157 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapsuper_internal.h
+ *	  Shared-memory MAP superblock table internals.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MAPSUPER_INTERNAL_H
+#define MAPSUPER_INTERNAL_H
+
+#include "storage/lwlock.h"
+#include "storage/mapsuper.h"
+#include "storage/relfilelocator.h"
+#include "storage/umfile.h"
+
+#define MAPSUPER_FLAG_VALID		0x01
+#define MAPSUPER_FLAG_DIRTY		0x02
+#define MAPSUPER_FLAG_CORRUPT	0x04
+
+#define MAPSUPER_RUNTIME_FLAG_EXTENDING_MAIN	0x08
+#define MAPSUPER_RUNTIME_FLAG_EXTENDING_FSM	0x10
+#define MAPSUPER_RUNTIME_FLAG_EXTENDING_VM	0x20
+
+typedef struct MapSuperTag
+{
+	RelFileLocator	rnode;
+} MapSuperTag;
+
+typedef struct MapSuperEntry
+{
+	MapSuperTag	key;
+	MapSuperblock super;
+	XLogRecPtr	page_lsn;
+	uint32		flags;
+	uint32		runtime_flags;
+	BlockNumber	reserved_next_free_main;
+	BlockNumber	reserved_next_free_fsm;
+	BlockNumber	reserved_next_free_vm;
+	BlockNumber	extending_target_main;
+	BlockNumber	extending_target_fsm;
+	BlockNumber	extending_target_vm;
+	int			next_free;
+	bool		in_use;
+	LWLock		lock;
+} MapSuperEntry;
+
+extern MapSuperEntry *MapSuperEntries;
+extern int	MapSuperCapacity;
+
+static inline MapSuperEntry *
+MapSuperEntryBySlot(int slot_id)
+{
+	Assert(slot_id >= 0 && slot_id < MapSuperCapacity);
+	return &MapSuperEntries[slot_id];
+}
+
+extern void MapSBlockReportCorrupt(RelFileLocator rnode, const char *reason);
+extern bool MapForkHasMappedState(ForkNumber forknum);
+extern BlockNumber MapNormalizeForkBlockCount(ForkNumber forknum,
+											  BlockNumber raw);
+
+static inline BlockNumber
+MapSuperGetReservedNextFree(const MapSuperEntry *entry, ForkNumber forknum)
+{
+	Assert(entry != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return entry->reserved_next_free_main;
+		case FSM_FORKNUM:
+			return entry->reserved_next_free_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return entry->reserved_next_free_vm;
+		default:
+			elog(ERROR, "unsupported fork number for reservation frontier: %d",
+				 forknum);
+	}
+
+	pg_unreachable();
+}
+
+static inline void
+MapSuperSetReservedNextFree(MapSuperEntry *entry, ForkNumber forknum,
+							BlockNumber blkno)
+{
+	Assert(entry != NULL);
+
+	blkno = MapNormalizeForkBlockCount(forknum, blkno);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			entry->reserved_next_free_main = blkno;
+			break;
+		case FSM_FORKNUM:
+			entry->reserved_next_free_fsm = blkno;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			entry->reserved_next_free_vm = blkno;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for reservation frontier: %d",
+				 forknum);
+	}
+}
+
+static inline void
+MapSuperMaybeBumpReservedNextFree(MapSuperEntry *entry, ForkNumber forknum,
+								  BlockNumber blkno)
+{
+	BlockNumber	current;
+
+	Assert(entry != NULL);
+
+	blkno = MapNormalizeForkBlockCount(forknum, blkno);
+	current = MapSuperGetReservedNextFree(entry, forknum);
+	if (current < blkno)
+		MapSuperSetReservedNextFree(entry, forknum, blkno);
+}
+
+static inline void
+MapSuperResetReservedNextFrees(MapSuperEntry *entry)
+{
+	Assert(entry != NULL);
+
+	MapSuperSetReservedNextFree(entry, MAIN_FORKNUM,
+								MapSuperblockGetNextFreePhysBlock(&entry->super,
+																  MAIN_FORKNUM));
+	MapSuperSetReservedNextFree(entry, FSM_FORKNUM,
+								MapSuperblockGetNextFreePhysBlock(&entry->super,
+																  FSM_FORKNUM));
+	MapSuperSetReservedNextFree(entry, VISIBILITYMAP_FORKNUM,
+								MapSuperblockGetNextFreePhysBlock(&entry->super,
+																  VISIBILITYMAP_FORKNUM));
+}
+
+extern bool MapSuperFindEntryLocked(RelFileLocator rnode, LWLockMode mode,
+									MapSuperEntry **entry);
+extern bool MapSuperFindEntryTryLocked(RelFileLocator rnode, LWLockMode mode,
+									   MapSuperEntry **entry);
+extern MapSuperEntry *MapSuperEnsureEntryLocked(RelFileLocator rnode);
+extern void MapSuperDeleteEntry(RelFileLocator rnode);
+extern bool MapSuperForkExists(const MapSuperblock *super,
+							   ForkNumber forknum);
+extern void MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx,
+									   RelFileLocator rnode,
+									   ForkNumber forknum,
+									   BlockNumber nblocks,
+									   bool bump_next_free,
+									   bool bump_capacity,
+									   XLogRecPtr map_lsn);
+extern void MapSuperTableShmemRequest(void);
+extern void MapSuperTableShmemInit(void);
+extern void MapSuperTableShmemAttach(void);
+
+#endif							/* MAPSUPER_INTERNAL_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8d06d69b51..47dbf12643 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -119,6 +119,12 @@ extern void smgrcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
 extern void smgrsyncrelationmetadata(SMgrRelation reln);
 extern void smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator,
 									   bool isRedo);
+extern bool smgrcreatedballowswallog(void);
+extern void smgrcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
+											  const Oid *tablespace_ids);
+extern void smgrinvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
+											  const Oid *tablespace_ids);
+extern void smgrinvalidatedatabase(Oid dbid);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *old_nblocks,
 						 BlockNumber *nblocks);
diff --git a/src/include/storage/subsystemlist.h b/src/include/storage/subsystemlist.h
index 9ad619080b..3b9bd4e9ee 100644
--- a/src/include/storage/subsystemlist.h
+++ b/src/include/storage/subsystemlist.h
@@ -42,6 +42,9 @@ PG_SHMEM_SUBSYSTEM(MultiXactShmemCallbacks)
 PG_SHMEM_SUBSYSTEM(BufferManagerShmemCallbacks)
 PG_SHMEM_SUBSYSTEM(StrategyCtlShmemCallbacks)
 PG_SHMEM_SUBSYSTEM(BufTableShmemCallbacks)
+#ifdef USE_UMBRA
+PG_SHMEM_SUBSYSTEM(MapShmemCallbacks)
+#endif

 /* lock manager */
 PG_SHMEM_SUBSYSTEM(LockManagerShmemCallbacks)
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index 88290500bc..559a8eea6c 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -39,6 +39,9 @@ typedef enum SyncRequestHandler
 	SYNC_HANDLER_COMMIT_TS,
 	SYNC_HANDLER_MULTIXACT_OFFSET,
 	SYNC_HANDLER_MULTIXACT_MEMBER,
+#ifdef USE_UMBRA
+	SYNC_HANDLER_UMBRA,
+#endif
 	SYNC_HANDLER_NONE,
 } SyncRequestHandler;

diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
index 2fb3c2f75e..b41fae75ea 100644
--- a/src/include/storage/umbra.h
+++ b/src/include/storage/umbra.h
@@ -17,6 +17,7 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "storage/um_defs.h"

 extern bool UmMetadataExists(SMgrRelation reln);
@@ -25,17 +26,25 @@ extern BlockNumber UmMetadataNblocks(SMgrRelation reln);
 extern void UmMetadataRead(SMgrRelation reln, BlockNumber blkno, void *buffer);
 extern void UmMetadataWrite(SMgrRelation reln, BlockNumber blkno,
 							const void *buffer, bool skipFsync);
+extern void UmMetadataWriteSuperblock(RelFileLocatorBackend rlocator,
+									  const void *sector, bool skipFsync);
 extern void UmMetadataExtend(SMgrRelation reln, BlockNumber blkno,
 							 const void *buffer, bool skipFsync);
 extern void UmMetadataImmediateSync(SMgrRelation reln);
 extern void UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo);
+extern void UmInvalidateDatabase(Oid dbid);

 extern void uminit(void);
 extern void umopen(SMgrRelation reln);
 extern void umclose(SMgrRelation reln, ForkNumber forknum);
 extern void umdestroy(SMgrRelation reln);
 extern bool umisinternalfork(ForkNumber forknum);
+extern bool umcreatedballowswallog(void);
 extern void umcreaterelationmetadata(SMgrRelation reln);
+extern void umcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
+											const Oid *tablespace_ids);
+extern void uminvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
+											const Oid *tablespace_ids);
 extern void umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
 								   char relpersistence);
 extern void umsyncrelationmetadata(SMgrRelation reln);
@@ -69,5 +78,8 @@ extern void umimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void umregistersync(SMgrRelation reln, ForkNumber forknum);
 extern int	umfd(SMgrRelation reln, ForkNumber forknum,
 				 BlockNumber blocknum, uint32 *off);
+extern int umsyncfiletag(const FileTag *ftag, char *path);
+extern int umunlinkfiletag(const FileTag *ftag, char *path);
+extern bool umfiletagmatches(const FileTag *ftag, const FileTag *candidate);

 #endif							/* UMBRA_H */
diff --git a/src/include/storage/umfile.h b/src/include/storage/umfile.h
index 56936aa697..8b7400140d 100644
--- a/src/include/storage/umfile.h
+++ b/src/include/storage/umfile.h
@@ -1,22 +1,20 @@
 /*-------------------------------------------------------------------------
  *
  * umfile.h
- *	  Umbra backend-local file/context helpers.
+ *    Umbra file/segment manager (backend-local).
  *
  * This layer owns backend-local file contexts keyed by RelFileLocatorBackend.
- * It is the low-level file access boundary beneath Umbra metadata and mapping
- * code.
- *
- * src/include/storage/umfile.h
+ * It provides low-level physical file/segment handling for Umbra forks.
  *
  *-------------------------------------------------------------------------
  */
+
 #ifndef UMFILE_H
 #define UMFILE_H

+#include "storage/fd.h"
#include "storage/aio_types.h"
#include "storage/block.h"
-#include "storage/fd.h"
#include "storage/relfilelocator.h"
#include "storage/um_defs.h"

@@ -34,68 +32,88 @@ typedef enum UmFileExistsMode
UMFILE_EXISTS_SPARSE
} UmFileExistsMode;

-extern void umfile_init(void);
-
+/*
+ * Backend-local context registry.
+ *
+ * umfile owns physical file contexts keyed by RelFileLocatorBackend. smgr and
+ * MAP may borrow a context, but umfile is the only owner.
+ */
 extern UmbraFileContext *umfile_ctx_lookup(RelFileLocatorBackend rlocator);
 extern UmbraFileContext *umfile_ctx_acquire(RelFileLocatorBackend rlocator);
-extern UmbraFileContext *umfile_ctx_create_temporary(RelFileLocatorBackend rlocator);
-extern void umfile_ctx_destroy_temporary(UmbraFileContext *ctx);
-extern void umfile_ctx_release(RelFileLocatorBackend rlocator);
 extern void umfile_ctx_forget(RelFileLocatorBackend rlocator);
 extern void umfile_ctx_close_fork(UmbraFileContext *ctx, ForkNumber forknum);
+extern UmbraFileContext *umfile_ctx_create_temporary(RelFileLocatorBackend rlocator);
+extern void umfile_ctx_destroy_temporary(UmbraFileContext *ctx);

+/*
+ * Low-level context I/O helpers for Umbra MAP subsystem.
+ *
+ * These provide direct physical addressing against fork files without going
+ * through smgr mapping translation.
+ */
 extern bool umfile_ctx_fork_exists(UmbraFileContext *ctx, ForkNumber forknum,
 								   UmFileExistsMode mode);
-extern BlockNumber umfile_ctx_get_nblocks(UmbraFileContext *ctx,
-										  ForkNumber forknum,
+extern BlockNumber umfile_ctx_get_nblocks(UmbraFileContext *ctx, ForkNumber forknum,
 										  UmFileNblocksMode mode);
-extern void umfile_ctx_read(UmbraFileContext *ctx, ForkNumber forknum,
-							BlockNumber blkno, char *buffer, int nbytes);
-extern void umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum,
-							 BlockNumber blkno, const char *buffer,
-							 int nbytes, bool skipFsync);
-extern void umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum,
-							  BlockNumber blkno, const char *buffer);
-extern void umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator,
-								  ForkNumber forknum, bool isRedo);
+extern void umfile_ctx_read(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+							char *buffer, int nbytes);
+extern void umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+							 const char *buffer, int nbytes, bool skipFsync);
+extern void umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
+							  const char *buffer);
+extern void umfile_ctx_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno);
+extern bool umfile_ctx_block_exists(UmbraFileContext *ctx, ForkNumber forknum,
+									BlockNumber blkno);
+extern bool umfile_ctx_segment_exists(UmbraFileContext *ctx, ForkNumber forknum,
+									  BlockNumber segno);
+extern void umfile_ctx_register_dirty(UmbraFileContext *ctx, ForkNumber forknum,
+									  BlockNumber blkno, bool skipFsync,
+									  bool isTempRelation);
+extern void umfile_ctx_unlinkfork(RelFileLocatorBackend rlocator, ForkNumber forkNum,
+								  bool isRedo);

+/* lifecycle */
+extern void umfile_init(void);
+
+/* smgr-equivalent operations (physical file semantics) */
+extern void umfile_create(UmbraFileContext *ctx, ForkNumber forknum, bool isRedo);
 extern bool umfile_exists(UmbraFileContext *ctx, ForkNumber forknum,
 						  UmFileExistsMode mode);
 extern bool umfile_open_or_create(UmbraFileContext *ctx, ForkNumber forknum,
 								  bool isRedo, bool *created);
+extern void umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
+extern void umfile_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+						  const void *buffer, bool skipFsync);
+extern void umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+							  int nblocks, bool skipFsync);
+extern bool umfile_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum, int nblocks);
+extern uint32 umfile_maxcombine(ForkNumber forknum, BlockNumber blocknum);
+extern void umfile_readv(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
+extern void umfile_startreadv(PgAioHandle *ioh, UmbraFileContext *ctx, ForkNumber forknum,
+							  BlockNumber blocknum, void **buffers, BlockNumber nblocks);
+/*
+ * Start an async read using physical addressing, while preserving the logical
+ * identity (block number) for error reporting and reopen semantics.
+ *
+ * This is used by Umbra's MAP translation: the file/offset are based on the
+ * physical block number, but smgr target identity remains logical.
+ */
+extern void umfile_startreadv_physical(PgAioHandle *ioh, UmbraFileContext *ctx,
+						   ForkNumber forknum,
+						   BlockNumber logical_blocknum,
+						   BlockNumber physical_blocknum,
+						   void **buffers, BlockNumber nblocks);
+extern void umfile_writev(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void umfile_writeback(UmbraFileContext *ctx, ForkNumber forknum,
+							 BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber umfile_nblocks(UmbraFileContext *ctx, ForkNumber forknum,
 								  UmFileNblocksMode mode);
-extern void umfile_readv(UmbraFileContext *ctx, ForkNumber forknum,
-						 BlockNumber blocknum, void **buffers,
-						 BlockNumber nblocks);
-extern void umfile_writev(UmbraFileContext *ctx, ForkNumber forknum,
-						  BlockNumber blocknum, const void **buffers,
-						  BlockNumber nblocks, bool skipFsync);
-extern void umfile_extend(UmbraFileContext *ctx, ForkNumber forknum,
-						  BlockNumber blocknum, const void *buffer,
-						  bool skipFsync);
-extern void umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum,
-							  BlockNumber blocknum, int nblocks,
-							  bool skipFsync);
 extern void umfile_truncate(UmbraFileContext *ctx, ForkNumber forknum,
 							BlockNumber old_blocks, BlockNumber nblocks);
 extern void umfile_immedsync(UmbraFileContext *ctx, ForkNumber forknum);
 extern void umfile_registersync(UmbraFileContext *ctx, ForkNumber forknum);
-extern void umfile_unlink(RelFileLocatorBackend rlocator, ForkNumber forknum,
-						  bool isRedo);
-
-/* Metadata-only convenience wrappers over the generic umfile surface. */
-extern bool umfile_metadata_exists(UmbraFileContext *ctx);
-extern bool umfile_metadata_open_or_create(UmbraFileContext *ctx,
-										   bool isRedo, bool *created);
-extern BlockNumber umfile_metadata_nblocks(UmbraFileContext *ctx);
-extern void umfile_metadata_read(UmbraFileContext *ctx, BlockNumber blkno,
-								 void *buffer);
-extern void umfile_metadata_write(UmbraFileContext *ctx, BlockNumber blkno,
-								  const void *buffer);
-extern void umfile_metadata_extend(UmbraFileContext *ctx, BlockNumber blkno,
-								   const void *buffer);
-extern void umfile_metadata_immedsync(UmbraFileContext *ctx);
-extern void umfile_metadata_unlink(RelFileLocatorBackend rlocator, bool isRedo);
+extern int umfile_fd(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum, uint32 *off);

 #endif							/* UMFILE_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 36d789720a..0cbdf133ca 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -61,6 +61,9 @@ tests += {
       't/050_redo_segment_missing.pl',
       't/051_effective_wal_level.pl',
       't/052_checkpoint_segment_missing.pl',
+      't/053_umbra_map_superblock_watermark.pl',
+      't/054_umbra_map_fork_policy.pl',
+      't/063_umbra_mainfork_head_unlink_checkpoint.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/053_umbra_map_superblock_watermark.pl b/src/test/recovery/t/053_umbra_map_superblock_watermark.pl
new file mode 100644
index 0000000000..5f254146d4
--- /dev/null
+++ b/src/test/recovery/t/053_umbra_map_superblock_watermark.pl
@@ -0,0 +1,104 @@
+# Verify MAP superblock watermarks don't regress across crash restart.
+#
+# This test is UMBRA-specific. In md mode there is no MAP fork, so skip.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+sub u32le_from_hex
+{
+	my ($hex, $offset) = @_;
+	my $chunk = substr($hex, $offset * 2, 8);
+	my @b = ($chunk =~ /../g);
+
+	return hex($b[0]) +
+	  (hex($b[1]) << 8) +
+	  (hex($b[2]) << 16) +
+	  (hex($b[3]) << 24);
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+$node->safe_psql(
+	'postgres', q{
+CREATE TABLE map_super_t(a int, b text);
+INSERT INTO map_super_t
+SELECT g, repeat('x', 400) FROM generate_series(1, 20000) g;
+CHECKPOINT;
+});
+
+my $map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('map_super_t') || '_map', 0, 64, true), 'hex');}
+);
+
+my $logical_expected_1 = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_size('map_super_t') / current_setting('block_size')::int;}
+);
+
+my $magic_1 = u32le_from_hex($map_super_hex, 0);
+my $version_1 = u32le_from_hex($map_super_hex, 4);
+my $blcksz_1 = u32le_from_hex($map_super_hex, 8);
+my $next_free_main_1 = u32le_from_hex($map_super_hex, 16);
+my $phys_capacity_main_1 = u32le_from_hex($map_super_hex, 20);
+my $logical_main_1 = u32le_from_hex($map_super_hex, 40);
+
+is($magic_1, 0x554D4252, 'superblock magic matches UMBR');
+is($version_1, 1, 'superblock version matches');
+is($blcksz_1, 8192, 'superblock block size matches');
+cmp_ok($logical_main_1, '==', $logical_expected_1,
+	'logical_nblocks_main matches relation size in blocks');
+cmp_ok($next_free_main_1, '>=', $logical_main_1,
+	'next_free_phys_block_main not behind logical_nblocks_main');
+cmp_ok($phys_capacity_main_1, '>=', $next_free_main_1,
+	'phys_capacity_main not behind next_free_phys_block_main');
+
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO map_super_t
+SELECT g, repeat('y', 400) FROM generate_series(20001, 40000) g;
+CHECKPOINT;
+});
+
+$node->stop('immediate');
+$node->start();
+
+$map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('map_super_t') || '_map', 0, 64, true), 'hex');}
+);
+
+my $logical_expected_2 = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_size('map_super_t') / current_setting('block_size')::int;}
+);
+
+my $next_free_main_2 = u32le_from_hex($map_super_hex, 16);
+my $phys_capacity_main_2 = u32le_from_hex($map_super_hex, 20);
+my $logical_main_2 = u32le_from_hex($map_super_hex, 40);
+
+cmp_ok($logical_main_2, '==', $logical_expected_2,
+	'logical_nblocks_main survives crash restart');
+cmp_ok($logical_main_2, '>=', $logical_main_1,
+	'logical_nblocks_main does not regress');
+cmp_ok($next_free_main_2, '>=', $next_free_main_1,
+	'next_free_phys_block_main does not regress');
+cmp_ok($phys_capacity_main_2, '>=', $phys_capacity_main_1,
+	'phys_capacity_main does not regress');
+cmp_ok($phys_capacity_main_2, '>=', $next_free_main_2,
+	'phys_capacity_main remains ahead of next_free_phys_block_main');
+
+done_testing();
diff --git a/src/test/recovery/t/054_umbra_map_fork_policy.pl b/src/test/recovery/t/054_umbra_map_fork_policy.pl
new file mode 100644
index 0000000000..99616152bb
--- /dev/null
+++ b/src/test/recovery/t/054_umbra_map_fork_policy.pl
@@ -0,0 +1,62 @@
+# Verify UMBRA MAP fork policy and drop lifecycle behavior.
+#
+# In UMBRA mode:
+# - permanent relations should have MAP fork
+# - unlogged/temp relations should not have MAP fork
+# - dropped permanent relation's MAP fork should disappear after checkpoint
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+my $perm_map_exists = $node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_perm_t(a int);
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_perm_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+my $unlogged_map_exists = $node->safe_psql(
+	'postgres', q{
+CREATE UNLOGGED TABLE umb_unlogged_t(a int);
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_unlogged_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+my $temp_map_exists = $node->safe_psql(
+	'postgres', q{
+CREATE TEMP TABLE umb_temp_t(a int);
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_temp_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+is($perm_map_exists, 't', 'permanent relation has MAP fork');
+is($unlogged_map_exists, 'f', 'unlogged relation has no MAP fork');
+is($temp_map_exists, 'f', 'temp relation has no MAP fork');
+
+my $perm_map_path =
+  $node->safe_psql('postgres',
+	q{SELECT pg_relation_filepath('umb_perm_t') || '_map';});
+
+$node->safe_psql('postgres', q{
+DROP TABLE umb_perm_t;
+CHECKPOINT;
+});
+
+ok($node->poll_query_until('postgres',
+	"SELECT COALESCE(encode(pg_read_binary_file('$perm_map_path', 0, 1, true), 'hex'), '') = '';", 't'),
+	'dropped permanent relation MAP fork disappears after checkpoint');
+
+done_testing();
diff --git a/src/test/recovery/t/063_umbra_mainfork_head_unlink_checkpoint.pl b/src/test/recovery/t/063_umbra_mainfork_head_unlink_checkpoint.pl
new file mode 100644
index 0000000000..a8dc86d728
--- /dev/null
+++ b/src/test/recovery/t/063_umbra_mainfork_head_unlink_checkpoint.pl
@@ -0,0 +1,60 @@
+# Verify UMBRA delayed unlink behavior for MAIN fork segment 0.
+#
+# In UMBRA mode for permanent relations:
+# - DROP first truncates MAIN seg0 to 0 bytes
+# - actual unlink of MAIN seg0 is delayed to checkpoint
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+$node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_head_unlink_t(id int, payload text);
+INSERT INTO umb_head_unlink_t
+SELECT g, repeat('z', 2000) FROM generate_series(1, 15000) g;
+});
+
+my $main_path = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_filepath('umb_head_unlink_t');}
+);
+
+cmp_ok(
+	$node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1);"),
+	'>',
+	0,
+	'MAIN seg0 size is non-zero before DROP');
+
+$node->safe_psql('postgres', q{DROP TABLE umb_head_unlink_t;});
+
+ok($node->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) = 0;"),
+	'MAIN seg0 is truncated to 0 before checkpoint (delayed unlink stage)');
+
+$node->safe_psql('postgres', q{CHECKPOINT;});
+
+ok($node->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) = -1;"),
+	'MAIN seg0 is physically removed after checkpoint');
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 06/10] umbra: add patch 5 MAP access policy, translation, and materialization

---
src/backend/catalog/storage.c | 201 +-
src/backend/storage/buffer/bufmgr.c | 12 +-
src/backend/storage/map/Makefile | 1 +
src/backend/storage/map/map.c | 335 ++-
src/backend/storage/map/mapbuf.c | 14 +
src/backend/storage/map/mapclock.c | 6 +-
src/backend/storage/map/mapinflight.c | 402 +++
src/backend/storage/map/mapinit.c | 7 +-
src/backend/storage/map/mapsuper.c | 26 +-
src/backend/storage/map/meson.build | 1 +
src/backend/storage/smgr/md.c | 1 +
src/backend/storage/smgr/smgr.c | 295 ++-
src/backend/storage/smgr/umbra.c | 2337 ++++++++++++++---
src/backend/storage/smgr/umfile.c | 19 +-
src/backend/utils/cache/relcache.c | 12 +-
src/include/storage/aio_types.h | 3 +-
src/include/storage/map.h | 37 +-
src/include/storage/map_internal.h | 23 +
src/include/storage/smgr.h | 30 +-
src/include/storage/um_defs.h | 18 +-
src/include/storage/umbra.h | 112 +-
src/test/recovery/meson.build | 1 +
.../t/061_umbra_fsm_vm_map_translation.pl | 117 +
23 files changed, 3557 insertions(+), 453 deletions(-)
create mode 100644 src/backend/storage/map/mapinflight.c
create mode 100644 src/test/recovery/t/061_umbra_fsm_vm_map_translation.pl

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 6b69329a52..be58c35191 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -74,8 +74,55 @@ typedef struct PendingRelSync
 	bool		is_truncated;	/* Has the file experienced truncation? */
 } PendingRelSync;

+typedef struct PendingRelTruncate
+{
+	RelFileLocator rlocator;
+	int			nestLevel;
+	struct PendingRelTruncate *next;
+} PendingRelTruncate;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 static HTAB *pendingSyncHash = NULL;
+static HTAB *pendingSkipWalStateHash = NULL;
+static PendingRelTruncate *pendingTruncates = NULL;
+
+static void
+AddPendingTruncate(const RelFileLocator *rlocator)
+{
+	PendingRelTruncate *pending;
+	int			nestLevel = GetCurrentTransactionNestLevel();
+
+	for (pending = pendingTruncates; pending != NULL; pending = pending->next)
+	{
+		if (RelFileLocatorEquals(pending->rlocator, *rlocator))
+		{
+			if (pending->nestLevel > nestLevel)
+				pending->nestLevel = nestLevel;
+			return;
+		}
+	}
+
+	pending = (PendingRelTruncate *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelTruncate));
+	pending->rlocator = *rlocator;
+	pending->nestLevel = nestLevel;
+	pending->next = pendingTruncates;
+	pendingTruncates = pending;
+}
+
+static void
+ClearPendingTruncates(void)
+{
+	PendingRelTruncate *pending;
+	PendingRelTruncate *next;
+
+	for (pending = pendingTruncates; pending != NULL; pending = next)
+	{
+		next = pending->next;
+		pfree(pending);
+	}
+	pendingTruncates = NULL;
+}

 /*
@@ -103,6 +150,21 @@ AddPendingSync(const RelFileLocator *rlocator)
 	pending = hash_search(pendingSyncHash, rlocator, HASH_ENTER, &found);
 	Assert(!found);
 	pending->is_truncated = false;
+
+	if (!pendingSkipWalStateHash)
+	{
+		HASHCTL		ctl;
+
+		ctl.keysize = sizeof(RelFileLocator);
+		ctl.entrysize = sizeof(RelFileLocator);
+		ctl.hcxt = TopTransactionContext;
+		pendingSkipWalStateHash = hash_create("pending skip-WAL state hash",
+											  16, &ctl,
+											  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	(void) hash_search(pendingSkipWalStateHash, rlocator, HASH_ENTER, &found);
+	smgrmarkskipwalpending(*rlocator);
 }

/*
@@ -149,9 +211,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,

 	srel = smgropen(rlocator, procNumber);
 	smgrcreate(srel, MAIN_FORKNUM, false);
-
-	if (needs_wal)
-		smgrcreaterelationmetadata(srel);
+	smgrinitnewrelation(srel, needs_wal);
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);

@@ -297,6 +357,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	BlockNumber old_blocks[MAX_FORKNUM];
 	BlockNumber blocks[MAX_FORKNUM];
 	int			nforks = 0;
+	XLogRecPtr	truncate_lsn = InvalidXLogRecPtr;
 	SMgrRelation reln;

/*
@@ -385,14 +446,11 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*
* (See also visibilitymap.c if changing this code.)
*/
- START_CRIT_SECTION();
-
if (RelationNeedsWAL(rel))
{
/*
* Make an XLOG entry reporting the file truncation.
*/
- XLogRecPtr lsn;
xl_smgr_truncate xlrec;

xlrec.blkno = nblocks;
@@ -402,8 +460,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
XLogBeginInsert();
XLogRegisterData(&xlrec, sizeof(xlrec));

-		lsn = XLogInsert(RM_SMGR_ID,
-						 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+		truncate_lsn = XLogInsert(RM_SMGR_ID,
+								  XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);

/*
* Flush, because otherwise the truncation of the main relation might
@@ -413,14 +471,23 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* contain entries for the non-existent heap pages, and standbys would
* also never replay the truncation.
*/
- XLogFlush(lsn);
+ XLogFlush(truncate_lsn);
}

+	/*
+	 * Apply storage-manager-specific truncate metadata updates before entering
+	 * the critical section. Umbra uses this to rewrite MAP metadata with the
+	 * truncate WAL LSN while MAP buffer reads and flushes are still legal.
+	 */
+	smgrpretruncate(RelationGetSmgr(rel), forks, nforks, old_blocks, blocks,
+					truncate_lsn);
+
 	/*
 	 * This will first remove any buffers from the buffer pool that should no
 	 * longer exist after truncation is complete, and then truncate the
 	 * corresponding files on disk.
 	 */
+	START_CRIT_SECTION();
 	smgrtruncate(RelationGetSmgr(rel), forks, nforks, old_blocks, blocks);

END_CRIT_SECTION();
@@ -453,6 +520,8 @@ RelationPreTruncate(Relation rel)
{
PendingRelSync *pending;

+	AddPendingTruncate(&(RelationGetSmgr(rel)->smgr_rlocator.locator));
+
 	if (!pendingSyncHash)
 		return;

@@ -581,6 +650,30 @@ RelFileLocatorSkippingWAL(RelFileLocator rlocator)
return true;
}

+/*
+ * RelFileLocatorWasTruncated
+ *		Check whether this transaction already marked the relfilenode truncated.
+ *
+ * A relfilenode can emit fresh page images for block 0 later in the same
+ * transaction after XLOG_SMGR_TRUNCATE has already been inserted. Recovery
+ * replays the truncate first and therefore sees no surviving old mapping, so
+ * producer-side first-born mapping logic must treat any still-visible local
+ * pre-truncate mapping as stale in that case.
+ */
+bool
+RelFileLocatorWasTruncated(RelFileLocator rlocator)
+{
+	PendingRelTruncate *pending;
+
+	for (pending = pendingTruncates; pending != NULL; pending = pending->next)
+	{
+		if (RelFileLocatorEquals(pending->rlocator, rlocator))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * EstimatePendingSyncsSpace
  *		Estimate space needed to pass syncs to parallel workers.
@@ -752,12 +845,17 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 	Assert(GetCurrentTransactionNestLevel() == 1);

 	if (!pendingSyncHash)
+	{
+		ClearPendingTruncates();
 		return;					/* no relation needs sync */
+	}

/* Abort -- just throw away all pending syncs */
if (!isCommit)
{
pendingSyncHash = NULL;
+ pendingSkipWalStateHash = NULL;
+ ClearPendingTruncates();
return;
}

@@ -767,6 +865,8 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
if (isParallelWorker)
{
pendingSyncHash = NULL;
+ pendingSkipWalStateHash = NULL;
+ ClearPendingTruncates();
return;
}

@@ -783,6 +883,7 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
BlockNumber nblocks[MAX_FORKNUM + 1];
uint64 total_blocks = 0;
SMgrRelation srel;
+ bool require_storage_sync;

srel = smgropen(pendingsync->rlocator, INVALID_PROC_NUMBER);

@@ -798,6 +899,12 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 		{
 			for (fork = 0; fork <= MAX_FORKNUM; fork++)
 			{
+				if (smgrisinternalfork(fork))
+				{
+					nblocks[fork] = InvalidBlockNumber;
+					continue;
+				}
+
 				if (smgrexists(srel, fork))
 				{
 					BlockNumber n = smgrnblocks(srel, fork);
@@ -812,6 +919,8 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 			}
 		}

+		require_storage_sync = smgrpreparependingsync(srel);
+
 		/*
 		 * Sync file or emit WAL records for its contents.
 		 *
@@ -825,6 +934,18 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 		 * main fork is longer than ever but FSM fork gets shorter.
 		 */
 		if (pendingsync->is_truncated ||
+			/*
+			 * New relfilenumbers that are still in PostgreSQL's mandatory
+			 * WAL-skipping state must reach disk via flush+sync, not via
+			 * log_newpage_range(). See "Skipping WAL for New RelFileLocator"
+			 * in src/backend/access/transam/README.
+			 */
+			RelFileLocatorSkippingWAL(pendingsync->rlocator) ||
+			/*
+			 * Some storage managers require flush+sync for their own durable
+			 * transition protocol even when the relation is small.
+			 */
+			require_storage_sync ||
 			total_blocks >= wal_skip_threshold * (uint64) 1024 / BLCKSZ)
 		{
 			/* allocate the initial array, or extend it, if needed */
@@ -866,12 +987,26 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 	}

pendingSyncHash = NULL;
+ ClearPendingTruncates();

 	if (nrels > 0)
 	{
 		smgrdosyncall(srels, nrels);
+
+		if (pendingSkipWalStateHash)
+		{
+			HASH_SEQ_STATUS clear_scan;
+			RelFileLocator *rlocator;
+
+			hash_seq_init(&clear_scan, pendingSkipWalStateHash);
+			while ((rlocator = (RelFileLocator *) hash_seq_search(&clear_scan)) != NULL)
+				smgrclearskipwalpending(*rlocator);
+		}
+
 		pfree(srels);
 	}
+
+	pendingSkipWalStateHash = NULL;
 }

/*
@@ -945,6 +1080,8 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ ClearPendingTruncates();
}

@@ -958,12 +1095,21 @@ AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
PendingRelDelete *pending;
+ PendingRelTruncate *pending_truncate;

 	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
 	{
 		if (pending->nestLevel >= nestLevel)
 			pending->nestLevel = nestLevel - 1;
 	}
+
+	for (pending_truncate = pendingTruncates;
+		 pending_truncate != NULL;
+		 pending_truncate = pending_truncate->next)
+	{
+		if (pending_truncate->nestLevel >= nestLevel)
+			pending_truncate->nestLevel = nestLevel - 1;
+	}
 }

 /*
@@ -976,7 +1122,28 @@ AtSubCommit_smgr(void)
 void
 AtSubAbort_smgr(void)
 {
+	PendingRelTruncate *pending;
+	PendingRelTruncate *prev = NULL;
+	PendingRelTruncate *next;
+	int			nestLevel = GetCurrentTransactionNestLevel();
+
 	smgrDoPendingDeletes(false);
+
+	for (pending = pendingTruncates; pending != NULL; pending = next)
+	{
+		next = pending->next;
+
+		if (pending->nestLevel >= nestLevel)
+		{
+			if (prev)
+				prev->next = next;
+			else
+				pendingTruncates = next;
+			pfree(pending);
+		}
+		else
+			prev = pending;
+	}
 }

void
@@ -993,11 +1160,12 @@ smgr_redo(XLogReaderState *record)
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;

-		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
-		smgrcreate(reln, xlrec->forkNum, true);
-	}
-	else if (info == XLOG_SMGR_TRUNCATE)
-	{
+			reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+			smgrcreate(reln, xlrec->forkNum, true);
+			smgrredocreatefork(reln, xlrec->forkNum, lsn);
+		}
+		else if (info == XLOG_SMGR_TRUNCATE)
+		{
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
 		SMgrRelation reln;
 		Relation	rel;
@@ -1016,7 +1184,7 @@ smgr_redo(XLogReaderState *record)
 		 * log as best we can until the drop is seen.
 		 */
 		smgrcreate(reln, MAIN_FORKNUM, true);
-		smgrcreaterelationmetadata(reln);
+		smgrredocreatefork(reln, MAIN_FORKNUM, lsn);

 		/*
 		 * Before we perform the truncation, update minimum recovery point to
@@ -1077,6 +1245,7 @@ smgr_redo(XLogReaderState *record)
 		/* Do the real work to truncate relation forks */
 		if (nforks > 0)
 		{
+			smgrpretruncate(reln, forks, nforks, old_blocks, blocks, lsn);
 			START_CRIT_SECTION();
 			smgrtruncate(reln, forks, nforks, old_blocks, blocks);
 			END_CRIT_SECTION();
@@ -1087,7 +1256,7 @@ smgr_redo(XLogReaderState *record)
 		 * important because the just-truncated pages were likely marked as
 		 * all-free, and would be preferentially selected.
 		 */
-		if (need_fsm_vacuum)
+		if (need_fsm_vacuum && smgrneedsrecoveryfsmvacuum(reln))
 			FreeSpaceMapVacuumRange(rel, xlrec->blkno,
 									InvalidBlockNumber);

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 540f346d53..3bf2db8fdf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5491,7 +5491,8 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	{
 		if (smgrexists(src_rel, forkNum))
 		{
-			smgrcreate(dst_rel, forkNum, false);
+			if (!smgrisinternalfork(forkNum))
+				smgrcreate(dst_rel, forkNum, false);

/*
* WAL log creation if the relation is persistent, or this is the
@@ -5500,9 +5501,12 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
if (permanent || forkNum == INIT_FORKNUM)
log_smgrcreate(&dst_rlocator, forkNum);

-			/* Copy a fork's data, block by block. */
-			RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
-										   permanent);
+			if (!smgrisinternalfork(forkNum))
+			{
+				/* Copy a fork's data, block by block. */
+				RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
+											   permanent);
+			}
 		}
 	}

diff --git a/src/backend/storage/map/Makefile b/src/backend/storage/map/Makefile
index 08c3b69679..94ae1c1b72 100644
--- a/src/backend/storage/map/Makefile
+++ b/src/backend/storage/map/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	mapbuf.o \
 	mapflush.o \
 	mapclock.o \
+	mapinflight.o \
 	mapsuper.o

 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
index 1c74aa94ef..bd839a3e9f 100644
--- a/src/backend/storage/map/map.c
+++ b/src/backend/storage/map/map.c
@@ -43,6 +43,8 @@ typedef struct MapTruncatePreloadState

static MapTruncatePreloadState MapTruncatePreload[MAX_FORKNUM + 1];

+#define MAP_PENDING_WAIT_RETRIES 10000
+#define MAP_PENDING_WAIT_USEC 1000

 typedef enum MapCachedLookupResult
 {
@@ -114,6 +116,20 @@ static MapCachedLookupResult MapTryLookupCachedPblknoInternal(RelFileLocator rno
 															  BlockNumber lblkno,
 															  bool adjust_usage,
 															  BlockNumber *pblkno);
+static bool MapTryReserveFreshPblknoInternal(UmbraFileContext *map_ctx,
+											 RelFileLocator rnode,
+											 ForkNumber forknum,
+											 BlockNumber lblkno,
+											 BlockNumber *new_pblkno,
+											 bool nowait);
+static bool MapWaitForForeignInflightToClear(RelFileLocator rnode,
+											 ForkNumber forknum,
+											 BlockNumber lblkno);
+bool MapReserveNextPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						  ForkNumber forknum, BlockNumber lblkno,
+						  BlockNumber *new_pblkno,
+						  bool nowait);
+
 void
 MapResetAllTruncatePreloads(void)
 {
@@ -531,13 +547,207 @@ done:
 	return run_blocks;
 }

+/*
+ * Reserve a brand-new in-flight physical target for lblkno.
+ *
+ * This helper does not consult existing in-flight state before consuming the
+ * next frontier block. If another backend already published an in-flight
+ * target for the same lblkno, publication will fail and the freshly reserved
+ * frontier slot becomes a hole. Callers that want a stable winner should
+ * fall back to a lookup after a false return.
+ */
+bool
+MapTryReserveFreshPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						 ForkNumber forknum, BlockNumber lblkno,
+						 BlockNumber *new_pblkno, bool nowait)
+{
+	return MapTryReserveFreshPblknoInternal(map_ctx, rnode, forknum, lblkno,
+											new_pblkno, nowait);
+}
+
+static bool
+MapTryReserveFreshPblknoInternal(UmbraFileContext *map_ctx, RelFileLocator rnode,
+								 ForkNumber forknum, BlockNumber lblkno,
+								 BlockNumber *new_pblkno, bool nowait)
+{
+	MapSuperEntry *entry;
+	BlockNumber		next;
+	uint32			flags;
+	bool			reserved = false;
+
+	Assert(new_pblkno != NULL);
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	if (!MapSBlockEnsureLoaded(map_ctx, rnode))
+		return false;
+
+	if (!MapInflightTryClaim(map_ctx, rnode, forknum, lblkno))
+		return false;
+
+	PG_TRY();
+	{
+		if (nowait)
+		{
+			if (!MapSuperFindEntryTryLocked(rnode, LW_EXCLUSIVE, &entry))
+				goto reserve_done;
+		}
+		else
+		{
+			if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+				goto reserve_done;
+		}
+
+		flags = entry->flags;
+		if ((flags & MAPSUPER_FLAG_CORRUPT) ||
+			!MapSuperblockHasValidIdentity(&entry->super) ||
+			((flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+			 !MapSuperblockCheckCRC(&entry->super)))
+		{
+			LWLockRelease(&entry->lock);
+			if (!InRecovery)
+				MapSBlockReportCorrupt(rnode, "invalid identity or CRC");
+			goto reserve_done;
+		}
+
+		Assert(MapNormalizeForkBlockCount(forknum,
+										  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			forknum)) <=
+			   MapSuperGetReservedNextFree(entry, forknum));
+		next = MapSuperGetReservedNextFree(entry, forknum);
+		if (next == InvalidBlockNumber - 1)
+		{
+			LWLockRelease(&entry->lock);
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("cannot allocate more physical blocks for relation %u/%u/%u fork %d",
+							rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum)));
+		}
+
+		MapSuperSetReservedNextFree(entry, forknum, next + 1);
+		Assert(MapNormalizeForkBlockCount(forknum,
+										  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			forknum)) <=
+			   MapSuperGetReservedNextFree(entry, forknum));
+
+		*new_pblkno = next;
+		LWLockRelease(&entry->lock);
+		MapInflightFinishClaim(rnode, forknum, lblkno, next);
+		reserved = true;
+
+reserve_done:
+		;
+	}
+	PG_CATCH();
+	{
+		MapInflightRelease(rnode, forknum, lblkno);
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	if (!reserved)
+	{
+		MapInflightRelease(rnode, forknum, lblkno);
+		return false;
+	}
+
+	return true;
+}
+
+static bool
+MapWaitForForeignInflightToClear(RelFileLocator rnode,
+								 ForkNumber forknum,
+								 BlockNumber lblkno)
+{
+	int			wait_retries = 0;
+
+	for (;;)
+	{
+		if (!MapInflightBitIsSet(rnode, forknum, lblkno))
+		{
+			if (wait_retries > 0)
+				elog(LOG,
+					 "foreground remap waited for in-flight remap on relation %u/%u/%u fork %d block %u (%d retries, %d usec)",
+					 rnode.spcOid, rnode.dbOid, rnode.relNumber,
+					 forknum, lblkno,
+					 wait_retries,
+					 wait_retries * MAP_PENDING_WAIT_USEC);
+			return wait_retries > 0;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(MAP_PENDING_WAIT_USEC);
+		wait_retries++;
+
+		if (wait_retries >= MAP_PENDING_WAIT_RETRIES)
+			ereport(ERROR,
+					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+					 errmsg("timed out waiting for in-flight remap of relation %u/%u/%u fork %d block %u",
+							rnode.spcOid, rnode.dbOid, rnode.relNumber,
+							forknum, lblkno)));
+	}
+}
+
+/*
+ * MapReserveNextPblkno - return a locally owned in-flight pblk for lblkno,
+ * creating it if needed.
+ *
+ * The shared MAP bit remains visible to other backends so they can observe
+ * contention, but the selected pblk stays owner-local and must not be
+ * borrowed by foreign callers. Foreign collisions are surfaced as false so
+ * callers can fail or fallback according to their ownership rules.
+ */
+bool
+MapReserveNextPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					 ForkNumber forknum, BlockNumber lblkno,
+					 BlockNumber *new_pblkno, bool nowait)
+{
+	if (MapInflightLookupOwnedPblk(rnode, forknum, lblkno, new_pblkno))
+		return true;
+
+	if (MapTryReserveFreshPblknoInternal(map_ctx, rnode, forknum, lblkno,
+										 new_pblkno, nowait))
+		return true;
+
+	return MapInflightLookupOwnedPblk(rnode, forknum, lblkno, new_pblkno);
+}
+
+/*
+ * Reserve a fresh physical target without consulting or updating any current
+ * mapping entry.
+ *
+ * Callers use this when they already own first publication for lblkno and
+ * must not reuse a locally visible old mapping.
+ */
+bool
+MapReserveFreshPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					  ForkNumber forknum, BlockNumber lblkno,
+					  BlockNumber *new_pblkno)
+{
+	if (MapReserveNextPblkno(map_ctx, rnode, forknum, lblkno,
+							 new_pblkno, false))
+	{
+		return true;
+	}
+
+	/*
+	 * "Fresh" callers are first-born owners. Seeing a foreign in-flight owner
+	 * here means the caller's ownership assumption was wrong, so surface the
+	 * conflict immediately and let the caller decide whether this is an
+	 * invariant violation or a fast-path fallback.
+	 */
+	return false;
+}
+

 /*
  * MapReadBuffer - read a map page into buffer
  *
  * Returns the slot_id of the buffer, with the buffer pinned.
  *
- * The caller owns the returned buffer pin.
+ * This function is extern because direct mapping publication needs to load
+ * and update MAP pages from outside map.c.
  */
 int
 MapReadBuffer(UmbraFileContext *map_ctx, RelFileLocator rnode,
@@ -629,9 +839,19 @@ MapReadBuffer(UmbraFileContext *map_ctx, RelFileLocator rnode,
 				continue;
 			}
 		}
+
+		if (buf->pending_count != 0)
+		{
+			LWLockRelease(&buf->buffer_lock);
+			MapUnpinBuffer(slot_id);
+			continue;
+		}
+
 		old_page_number = buf->page_number;
 		old_forknum = buf->forknum;
 		old_rnode = buf->rnode;
+		MemSet(buf->pending_bits, 0, sizeof(buf->pending_bits));
+
 		existing_slot_id = MapCacheInsert(rnode, forknum, map_blkno, slot_id);
 		if (existing_slot_id >= 0 && existing_slot_id != slot_id)
 			retry = true;
@@ -1043,7 +1263,7 @@ MapInvalidateDatabase(Oid dbid)
 }

 /*
- * MapGetLogicalBlockCount - return the persisted logical block count.
+ * MapGetLogicalBlockCount - get logical block count from the MAP superblock
  */
 BlockNumber
 MapGetLogicalBlockCount(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber forknum)
@@ -1137,3 +1357,114 @@ MapGetPhysicalBlockCount(UmbraFileContext *map_ctx, RelFileLocator rnode,

 	return max_pblkno + 1;
 }
+
+/*
+ * MapGetNewPblkno - allocate new physical block and return both old and new
+ *
+ * This queries the current mapping, reserves a new physical block, and returns
+ * both the old and new physical block numbers to the caller that owns the
+ * mapping transition.
+ *
+ * Parameters:
+ *   map_ctx: MAP fork context for file I/O operations
+ *   rnode: relation identifier
+ *   forknum: fork number
+ *   lblkno: logical block number
+ *   new_pblkno: output parameter for the new physical block number
+ *   old_pblkno: output parameter for the old physical block number
+ */
+void MapGetNewPbkno(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber forknum,
+				BlockNumber lblkno, BlockNumber *new_pblkno,
+				BlockNumber *old_pblkno)
+{
+	BlockNumber cur_pblkno;
+
+	Assert(new_pblkno != NULL);
+	Assert(old_pblkno != NULL);
+
+	for (;;)
+	{
+		if (!MapTryLookup(map_ctx, rnode, forknum, lblkno, &cur_pblkno))
+		{
+			*old_pblkno = InvalidBlockNumber;
+		}
+		else
+		{
+			*old_pblkno = cur_pblkno;
+		}
+
+		if (MapReserveNextPblkno(map_ctx, rnode, forknum, lblkno,
+								 new_pblkno, false))
+			break;
+
+		/*
+		 * A foreign in-flight owner controls the current mapping publication
+		 * decision. Wait for it to publish, then retry against committed MAP
+		 * state instead of guessing from buffers or page LSNs.
+		 */
+		if (!MapWaitForForeignInflightToClear(rnode, forknum, lblkno))
+			elog(ERROR,
+				 "failed to reserve physical block for relation %u/%u/%u fork %d blk %u",
+				 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum, lblkno);
+	}
+}
+
+/*
+ * MapSetMapping - set a mapping entry directly
+ *
+ * This function sets a mapping entry directly without allocating a new
+ * physical block. Callers must already own the mapping publication decision.
+ *
+ * Parameters:
+ *   map_ctx: MAP fork context for file I/O operations
+ *   rnode: relation identifier
+ *   forknum: fork number
+ *   lblkno: logical block number
+ *   new_pblkno: physical block number to set
+ *
+ * This function does NOT allocate a new physical block and does NOT advance
+ * superblock frontier/watermark state. Callers that own a fresh physical
+ * allocation must publish frontier changes explicitly.
+ */
+void
+MapSetMapping(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber forknum,
+			  BlockNumber lblkno, BlockNumber new_pblkno, XLogRecPtr map_lsn)
+{
+	BlockNumber  map_blkno;
+	int          slot_id;
+	int          entry_idx;
+	MapPage     *page;
+	MapBufferDesc *buf;
+
+	/* Convert (forknum, lblkno) to Umbra metadata-fork block number */
+	map_blkno = MapLblknoToMapBlkno(forknum, lblkno);
+	entry_idx = map_blkno % MAP_ENTRIES_PER_PAGE;
+	map_blkno = map_blkno / MAP_ENTRIES_PER_PAGE;
+
+	/* Read or allocate the map page */
+	slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+	buf = &MapBuffers[slot_id];
+	page = MapGetPage(slot_id);
+
+	/* Acquire buffer lock for modifying map page content */
+	LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+
+	/* Set the mapping directly */
+	page->pblknos[entry_idx] = new_pblkno;
+
+	/* Record LSN while holding page lock to avoid content/LSN reordering. */
+	if (map_lsn == InvalidXLogRecPtr)
+	{
+		if (InRecovery)
+			map_lsn = GetXLogReplayRecPtr(NULL);
+		else
+			map_lsn = GetXLogWriteRecPtr();
+	}
+	MapMarkBufferDirty(map_ctx, buf, map_lsn);
+
+	/* Release buffer lock after modification */
+	LWLockRelease(&buf->buffer_lock);
+
+	/* Unpin the buffer */
+	MapUnpinBuffer(slot_id);
+}
diff --git a/src/backend/storage/map/mapbuf.c b/src/backend/storage/map/mapbuf.c
index cb8b59dfbc..4e36a9b79e 100644
--- a/src/backend/storage/map/mapbuf.c
+++ b/src/backend/storage/map/mapbuf.c
@@ -258,6 +258,8 @@ MapBackendExitCleanup(void)
 	 * progress even if current backend is leaving via ERROR/abort.
 	 */
 	MapAbortBufferIO();
+	MapInflightCleanupOwned();
+
 	if (MapPrivateRefCount == NULL)
 		return;

@@ -401,6 +403,18 @@ retry:
 		MapWaitIO(buf);
 		goto retry;
 	}
+
+	if (buf->pending_count != 0)
+	{
+		LWLockRelease(&buf->buffer_lock);
+		LWLockRelease(&buf->io_in_progress_lock);
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+		goto retry;
+	}
+
+	MemSet(buf->pending_bits, 0, sizeof(buf->pending_bits));
+
 	buf->page_number = -1;
 	buf->forknum = InvalidForkNumber;
 	memset(&buf->rnode, 0, sizeof(RelFileLocator));
diff --git a/src/backend/storage/map/mapclock.c b/src/backend/storage/map/mapclock.c
index 6fa62e1c1a..3ccdbb2310 100644
--- a/src/backend/storage/map/mapclock.c
+++ b/src/backend/storage/map/mapclock.c
@@ -308,7 +308,8 @@ MapClockGetBuffer(void)
 			local_buf_state = pg_atomic_read_u32(&buf->state);

 			if (MAPBUF_GET_REFCOUNT(local_buf_state) == 0 &&
-				MAPBUF_GET_USAGECOUNT(local_buf_state) == 0)
+				MAPBUF_GET_USAGECOUNT(local_buf_state) == 0 &&
+				buf->pending_count == 0)
 			{
 				/* Found a usable buffer */
 				pg_atomic_fetch_add_u32(&MapShared->num_allocs, 1);
@@ -344,7 +345,8 @@ MapClockGetBuffer(void)
 		 * If the buffer is pinned, we cannot use it.
 		 * If it has a non-zero usage_count, decrement it and continue.
 		 */
-		if (MAPBUF_GET_REFCOUNT(local_buf_state) == 0)
+		if (MAPBUF_GET_REFCOUNT(local_buf_state) == 0 &&
+			buf->pending_count == 0)
 		{
 			if (MAPBUF_GET_USAGECOUNT(local_buf_state) != 0)
 			{
diff --git a/src/backend/storage/map/mapinflight.c b/src/backend/storage/map/mapinflight.c
new file mode 100644
index 0000000000..c204a1fc03
--- /dev/null
+++ b/src/backend/storage/map/mapinflight.c
@@ -0,0 +1,402 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapinflight.c
+ *	  MAP in-flight remap ownership tracking.
+ *
+ * Shared state is deliberately only a per-MAP-buffer bitmap. The bitmap
+ * serializes remaps of the same logical MAP entry across backends; the pblk
+ * reserved by the owner is backend-local, so other backends cannot borrow an
+ * uncommitted physical target.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "utils/memutils.h"
+
+typedef struct MapInflightLocalEntry
+{
+	RelFileLocator rnode;
+	ForkNumber	forknum;
+	BlockNumber	lblkno;
+	BlockNumber	pblkno;
+	int			slot_id;
+	int			entry_idx;
+} MapInflightLocalEntry;
+
+static MemoryContext MapInflightLocalCxt = NULL;
+static MapInflightLocalEntry *MapInflightLocalEntries = NULL;
+static int	MapInflightLocalCount = 0;
+static int	MapInflightLocalCapacity = 0;
+
+static void MapInflightLocalEnsureContext(void);
+static void MapInflightLocalEnsureCapacity(int needed);
+static int	MapInflightLocalFind(RelFileLocator rnode, ForkNumber forknum,
+								 BlockNumber lblkno);
+static void MapInflightLocalForget(RelFileLocator rnode, ForkNumber forknum,
+								   BlockNumber lblkno);
+static void MapInflightLocalRememberPrepared(RelFileLocator rnode,
+											 ForkNumber forknum,
+											 BlockNumber lblkno,
+											 BlockNumber pblkno,
+											 int slot_id, int entry_idx);
+static void MapInflightDecode(ForkNumber forknum, BlockNumber lblkno,
+							  BlockNumber *map_blkno, int *entry_idx);
+static inline uint64 MapInflightEntryMask(int entry_idx);
+static bool MapInflightBufferTryClaim(RelFileLocator rnode, ForkNumber forknum,
+									  BlockNumber map_blkno, int slot_id,
+									  int entry_idx);
+static void MapInflightBufferRelease(int slot_id, int entry_idx);
+
+static void
+MapInflightLocalEnsureContext(void)
+{
+	if (MapInflightLocalCxt == NULL)
+	{
+		MapInflightLocalCxt = AllocSetContextCreate(TopMemoryContext,
+												   "MapInflightLocal",
+												   ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(MapInflightLocalCxt, true);
+	}
+}
+
+void
+MapInflightBackendInit(void)
+{
+	MapInflightLocalEnsureContext();
+}
+
+static void
+MapInflightLocalEnsureCapacity(int needed)
+{
+	int			new_capacity;
+
+	MapInflightLocalEnsureContext();
+
+	if (MapInflightLocalCapacity >= needed)
+		return;
+
+	new_capacity = (MapInflightLocalCapacity == 0) ? 16 : MapInflightLocalCapacity;
+	while (new_capacity < needed)
+		new_capacity *= 2;
+
+	if (MapInflightLocalEntries == NULL)
+		MapInflightLocalEntries =
+			MemoryContextAlloc(MapInflightLocalCxt,
+							   sizeof(MapInflightLocalEntry) * new_capacity);
+	else
+		MapInflightLocalEntries =
+			repalloc(MapInflightLocalEntries,
+					 sizeof(MapInflightLocalEntry) * new_capacity);
+
+	MapInflightLocalCapacity = new_capacity;
+}
+
+static int
+MapInflightLocalFind(RelFileLocator rnode, ForkNumber forknum,
+					 BlockNumber lblkno)
+{
+	int			i;
+
+	for (i = 0; i < MapInflightLocalCount; i++)
+	{
+		MapInflightLocalEntry *entry = &MapInflightLocalEntries[i];
+
+		if (!RelFileLocatorEquals(entry->rnode, rnode))
+			continue;
+		if (entry->forknum != forknum)
+			continue;
+		if (entry->lblkno != lblkno)
+			continue;
+		return i;
+	}
+
+	return -1;
+}
+
+static void
+MapInflightLocalRememberPrepared(RelFileLocator rnode, ForkNumber forknum,
+								 BlockNumber lblkno, BlockNumber pblkno,
+								 int slot_id, int entry_idx)
+{
+	Assert(MapInflightLocalFind(rnode, forknum, lblkno) < 0);
+	Assert(MapInflightLocalCount < MapInflightLocalCapacity);
+
+	MapInflightLocalEntries[MapInflightLocalCount].rnode = rnode;
+	MapInflightLocalEntries[MapInflightLocalCount].forknum = forknum;
+	MapInflightLocalEntries[MapInflightLocalCount].lblkno = lblkno;
+	MapInflightLocalEntries[MapInflightLocalCount].pblkno = pblkno;
+	MapInflightLocalEntries[MapInflightLocalCount].slot_id = slot_id;
+	MapInflightLocalEntries[MapInflightLocalCount].entry_idx = entry_idx;
+	MapInflightLocalCount++;
+}
+
+static void
+MapInflightDecode(ForkNumber forknum, BlockNumber lblkno,
+				  BlockNumber *map_blkno, int *entry_idx)
+{
+	BlockNumber	map_entry_no;
+
+	Assert(map_blkno != NULL);
+	Assert(entry_idx != NULL);
+
+	map_entry_no = MapLblknoToMapBlkno(forknum, lblkno);
+	*entry_idx = map_entry_no % MAP_ENTRIES_PER_PAGE;
+	*map_blkno = map_entry_no / MAP_ENTRIES_PER_PAGE;
+	Assert(*entry_idx >= 0 && *entry_idx < MAP_ENTRIES_PER_PAGE);
+}
+
+static inline uint64
+MapInflightEntryMask(int entry_idx)
+{
+	Assert(entry_idx >= 0 && entry_idx < MAP_ENTRIES_PER_PAGE);
+	return UINT64CONST(1) << (entry_idx % MAP_PENDING_BITS_PER_WORD);
+}
+
+static bool
+MapInflightBufferTryClaim(RelFileLocator rnode, ForkNumber forknum,
+						  BlockNumber map_blkno, int slot_id, int entry_idx)
+{
+	MapBufferDesc *buf = &MapBuffers[slot_id];
+	int			word_idx = entry_idx / MAP_PENDING_BITS_PER_WORD;
+	uint64		mask = MapInflightEntryMask(entry_idx);
+	bool		claimed = false;
+
+	LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+	if (buf->page_number == map_blkno &&
+		buf->forknum == forknum &&
+		RelFileLocatorEquals(buf->rnode, rnode))
+	{
+		if ((buf->pending_bits[word_idx] & mask) == 0)
+		{
+			buf->pending_bits[word_idx] |= mask;
+			buf->pending_count++;
+			claimed = true;
+		}
+	}
+	LWLockRelease(&buf->buffer_lock);
+
+	return claimed;
+}
+
+static void
+MapInflightBufferRelease(int slot_id, int entry_idx)
+{
+	MapBufferDesc *buf = &MapBuffers[slot_id];
+	int			word_idx = entry_idx / MAP_PENDING_BITS_PER_WORD;
+	uint64		mask = MapInflightEntryMask(entry_idx);
+
+	LWLockAcquire(&buf->buffer_lock, LW_EXCLUSIVE);
+	Assert(buf->pending_bits[word_idx] & mask);
+	if ((buf->pending_bits[word_idx] & mask) != 0)
+	{
+		buf->pending_bits[word_idx] &= ~mask;
+		Assert(buf->pending_count > 0);
+		buf->pending_count--;
+	}
+	LWLockRelease(&buf->buffer_lock);
+
+	MapUnpinBuffer(slot_id);
+}
+
+static void
+MapInflightLocalForget(RelFileLocator rnode, ForkNumber forknum,
+					   BlockNumber lblkno)
+{
+	int			idx;
+
+	idx = MapInflightLocalFind(rnode, forknum, lblkno);
+	if (idx < 0)
+		return;
+
+	MapInflightLocalCount--;
+	if (idx != MapInflightLocalCount)
+		MapInflightLocalEntries[idx] = MapInflightLocalEntries[MapInflightLocalCount];
+}
+
+bool
+MapInflightLookupOwnedPblk(RelFileLocator rnode,
+						   ForkNumber forknum,
+						   BlockNumber lblkno,
+						   BlockNumber *pblkno)
+{
+	int			idx;
+
+	Assert(pblkno != NULL);
+
+	idx = MapInflightLocalFind(rnode, forknum, lblkno);
+	if (idx < 0)
+		return false;
+	if (MapInflightLocalEntries[idx].pblkno == InvalidBlockNumber)
+		return false;
+
+	*pblkno = MapInflightLocalEntries[idx].pblkno;
+	return true;
+}
+
+bool
+MapInflightTryClaimBarrier(UmbraFileContext *map_ctx,
+						   RelFileLocator rnode,
+						   ForkNumber forknum,
+						   BlockNumber lblkno,
+						   MapInflightBarrier *barrier)
+{
+	BlockNumber	map_blkno;
+	int			entry_idx;
+	int			slot_id;
+
+	Assert(barrier != NULL);
+	Assert(!barrier->valid);
+
+	if (MapInflightLocalFind(rnode, forknum, lblkno) >= 0)
+		elog(ERROR,
+			 "cannot claim write barrier while owning in-flight remap for relation %u/%u/%u fork %d block %u",
+			 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum, lblkno);
+
+	MapInflightDecode(forknum, lblkno, &map_blkno, &entry_idx);
+
+	Assert(map_ctx != NULL);
+	slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+
+	if (!MapInflightBufferTryClaim(rnode, forknum, map_blkno, slot_id,
+								   entry_idx))
+	{
+		MapUnpinBuffer(slot_id);
+		return false;
+	}
+
+	barrier->valid = true;
+	barrier->slot_id = slot_id;
+	barrier->entry_idx = entry_idx;
+
+	return true;
+}
+
+void
+MapInflightReleaseBarrier(MapInflightBarrier *barrier)
+{
+	if (barrier == NULL || !barrier->valid)
+		return;
+
+	MapInflightBufferRelease(barrier->slot_id, barrier->entry_idx);
+	barrier->valid = false;
+	barrier->slot_id = -1;
+	barrier->entry_idx = -1;
+}
+
+bool
+MapInflightTryClaim(UmbraFileContext *map_ctx, RelFileLocator rnode,
+					ForkNumber forknum, BlockNumber lblkno)
+{
+	BlockNumber	map_blkno;
+	int			entry_idx;
+	int			slot_id;
+
+	if (MapInflightLocalFind(rnode, forknum, lblkno) >= 0)
+		return false;
+
+	/*
+	 * Ensure backend-local storage before publishing any shared in-flight
+	 * state, so the post-claim path cannot throw due to allocation.
+	 */
+	MapInflightLocalEnsureCapacity(MapInflightLocalCount + 1);
+	MapInflightDecode(forknum, lblkno, &map_blkno, &entry_idx);
+
+	Assert(map_ctx != NULL);
+	slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+
+	if (!MapInflightBufferTryClaim(rnode, forknum, map_blkno, slot_id,
+								   entry_idx))
+	{
+		MapUnpinBuffer(slot_id);
+		return false;
+	}
+
+	MapInflightLocalRememberPrepared(rnode, forknum, lblkno,
+									 InvalidBlockNumber,
+									 slot_id, entry_idx);
+
+	return true;
+}
+
+void
+MapInflightFinishClaim(RelFileLocator rnode, ForkNumber forknum,
+					   BlockNumber lblkno, BlockNumber pblkno)
+{
+	int			local_idx;
+
+	Assert(pblkno != InvalidBlockNumber);
+
+	local_idx = MapInflightLocalFind(rnode, forknum, lblkno);
+	if (local_idx < 0)
+		elog(ERROR,
+			 "in-flight remap claim disappeared for relation %u/%u/%u fork %d block %u",
+			 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum, lblkno);
+
+	MapInflightLocalEntries[local_idx].pblkno = pblkno;
+}
+
+void
+MapInflightRelease(RelFileLocator rnode, ForkNumber forknum,
+				   BlockNumber lblkno)
+{
+	int			idx;
+	int			slot_id;
+	int			entry_idx;
+
+	idx = MapInflightLocalFind(rnode, forknum, lblkno);
+	if (idx < 0)
+		return;
+
+	slot_id = MapInflightLocalEntries[idx].slot_id;
+	entry_idx = MapInflightLocalEntries[idx].entry_idx;
+	MapInflightLocalForget(rnode, forknum, lblkno);
+	MapInflightBufferRelease(slot_id, entry_idx);
+}
+
+bool
+MapInflightBitIsSet(RelFileLocator rnode, ForkNumber forknum,
+					BlockNumber lblkno)
+{
+	BlockNumber	map_blkno;
+	int			entry_idx;
+	int			slot_id;
+	MapBufferDesc *buf;
+	int			word_idx;
+	uint64		mask;
+	bool		exists = false;
+
+	MapInflightDecode(forknum, lblkno, &map_blkno, &entry_idx);
+	slot_id = MapCacheLookup(rnode, forknum, map_blkno);
+	if (slot_id < 0)
+		return false;
+
+	buf = &MapBuffers[slot_id];
+	word_idx = entry_idx / MAP_PENDING_BITS_PER_WORD;
+	mask = MapInflightEntryMask(entry_idx);
+
+	MapPinBuffer(slot_id, false);
+	LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+	if (buf->page_number == map_blkno &&
+		buf->forknum == forknum &&
+		RelFileLocatorEquals(buf->rnode, rnode))
+		exists = (buf->pending_bits[word_idx] & mask) != 0;
+	LWLockRelease(&buf->buffer_lock);
+	MapUnpinBuffer(slot_id);
+
+	return exists;
+}
+
+void
+MapInflightCleanupOwned(void)
+{
+	while (MapInflightLocalCount > 0)
+	{
+		MapInflightLocalEntry *entry = &MapInflightLocalEntries[0];
+
+		MapInflightRelease(entry->rnode, entry->forknum, entry->lblkno);
+	}
+}
diff --git a/src/backend/storage/map/mapinit.c b/src/backend/storage/map/mapinit.c
index a0880113ed..c9ddd12ff0 100644
--- a/src/backend/storage/map/mapinit.c
+++ b/src/backend/storage/map/mapinit.c
@@ -66,7 +66,9 @@ MapBackendInit(void)
 		return;

 	MapRefreshBufferSlots();
-	MapEnsurePrivateRefCount();	initialized = true;
+	MapEnsurePrivateRefCount();
+	MapInflightBackendInit();
+	initialized = true;
 }

 static void
@@ -121,6 +123,9 @@ MapShmemInit(void *arg)
 		buf->forknum = InvalidForkNumber;
 		buf->page_number = -1;
 		buf->page_lsn = 0;
+		buf->pending_count = 0;
+		MemSet(buf->pending_bits, 0, sizeof(buf->pending_bits));
+
 		LWLockInitialize(&buf->buffer_lock, LWTRANCHE_MAP_BUFFER_CONTENT);
 		LWLockInitialize(&buf->io_in_progress_lock, LWTRANCHE_MAP_BUFFER_CONTENT);
 	}
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index cf8bde182e..ad4a6f6bdb 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -838,7 +838,6 @@ MapSuperSetExtendingTarget(MapSuperEntry *entry, ForkNumber forknum,

-
 static bool
 MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 							  XLogRecPtr map_lsn, const char *missing_errmsg,
@@ -871,12 +870,14 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 				entry->super = disk_super;
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
+				MapSuperResetReservedNextFrees(entry);
 			}
 			else
 			{
 				MapSuperblockInit(&entry->super, 0);
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+				MapSuperResetReservedNextFrees(entry);
 			}
 		}
 	}
@@ -897,8 +898,21 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 		 */
 		MapSuperblockInit(&entry->super, 0);
 		entry->flags = MAPSUPER_FLAG_VALID;
+		MapSuperResetReservedNextFrees(entry);
 	}

+	Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		MAIN_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		FSM_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		VISIBILITYMAP_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 	*entry_p = entry;
 	return true;
 }
@@ -1001,11 +1015,15 @@ MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx, RelFileLocator rnode,
 	current_capacity = MapSuperblockGetPhysCapacity(&entry->super, forknum);
 	current_next = MapNormalizeForkBlockCount(forknum, current_next);
 	current_capacity = MapNormalizeForkBlockCount(forknum, current_capacity);
+	Assert(MapNormalizeForkBlockCount(forknum,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		forknum)) <=
+		   MapSuperGetReservedNextFree(entry, forknum));

 	if (bump_next_free && current_next < nblocks)
 	{
 		MapSuperblockSetNextFreePhysBlock(&entry->super, forknum, nblocks);
-		if (InRecovery)
+		MapSuperMaybeBumpReservedNextFree(entry, forknum, nblocks);
 		changed = true;
 	}
 	if (bump_capacity && current_capacity < nblocks)
@@ -1028,6 +1046,10 @@ MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx, RelFileLocator rnode,
 		entry->flags |= MAPSUPER_FLAG_DIRTY;
 	}

+	Assert(MapNormalizeForkBlockCount(forknum,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		forknum)) <=
+		   MapSuperGetReservedNextFree(entry, forknum));
 	LWLockRelease(&entry->lock);
 }

diff --git a/src/backend/storage/map/meson.build b/src/backend/storage/map/meson.build
index 8747f0b714..bdaa0dd14a 100644
--- a/src/backend/storage/map/meson.build
+++ b/src/backend/storage/map/meson.build
@@ -6,5 +6,6 @@ backend_sources += files(
   'mapbuf.c',
   'mapflush.c',
   'mapclock.c',
+  'mapinflight.c',
   'mapsuper.c',
 )
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index dee29037b1..ffc4bb83e9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1034,6 +1034,7 @@ mdstartreadv(PgAioHandle *ioh,
 							 reln,
 							 forknum,
 							 blocknum,
+							 blocknum,
 							 nblocks,
 							 false);
 	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 631d09d4b4..1e3e0b08f8 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -4,9 +4,8 @@
  *	  public interface routines to storage manager switch.
  *
  * All file system operations on relations dispatch through these routines.
- * An SMgrRelation represents storage-manager state for a relation.  The
- * selected storage manager implementation owns any implementation-specific
- * state needed to service those operations.
+ * An SMgrRelation represents physical on-disk relation files that are open
+ * for reading and writing.
  *
  * When a relation is first accessed through the relation cache, the
  * corresponding SMgrRelation entry is opened by calling smgropen(), and the
@@ -93,6 +92,7 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
+	void		(*smgr_before_shmem_exit_cleanup) (void);	/* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_destroy) (SMgrRelation reln);	/* may be NULL */
@@ -123,10 +123,15 @@ typedef struct f_smgr
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_pretruncate) (SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber old_blocks, BlockNumber nblocks,
+									 XLogRecPtr truncate_lsn);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
+	bool		(*smgr_is_internal_fork) (ForkNumber forknum);
 	void		(*smgr_create_relation_metadata) (SMgrRelation reln);
 	void		(*smgr_copy_relation_metadata) (SMgrRelation src,
 												SMgrRelation dst,
@@ -134,61 +139,30 @@ typedef struct f_smgr
 	void		(*smgr_sync_relation_metadata) (SMgrRelation reln);
 	void		(*smgr_unlink_relation_metadata) (RelFileLocatorBackend rlocator,
 												  bool isRedo);
+	void		(*smgr_setmapstate) (SMgrRelation reln, uint8 map_state);
 	bool		(*smgr_createdb_allows_wal_log) (void);
+	void		(*smgr_init_new_relation) (SMgrRelation reln, bool needs_wal);
+	void		(*smgr_redo_create_fork) (SMgrRelation reln, ForkNumber forknum,
+										  XLogRecPtr lsn);
 	void		(*smgr_checkpoint_database_tablespaces) (Oid dbid,
 														 int ntablespaces,
 														 const Oid *tablespace_ids);
 	void		(*smgr_invalidate_database_tablespaces) (Oid dbid,
 														 int ntablespaces,
 														 const Oid *tablespace_ids);
-	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
+	void		(*smgr_mark_skip_wal_pending) (SMgrRelation reln);
+	void		(*smgr_clear_skip_wal_pending) (SMgrRelation reln);
+	bool		(*smgr_prepare_pendingsync) (SMgrRelation reln);
+	bool		(*smgr_needs_recovery_fsm_vacuum) (SMgrRelation reln);
 } f_smgr;

-#define SMGR_MD		0
-#ifdef USE_UMBRA
-#define SMGR_UMBRA	1
-#define SMGR_DEFAULT	SMGR_UMBRA
-#else
-#define SMGR_DEFAULT	SMGR_MD
-#endif
-
 static const f_smgr smgrsw[] = {
-	/* magnetic disk */
-	{
-		.smgr_init = mdinit,
-		.smgr_shutdown = NULL,
-		.smgr_open = mdopen,
-		.smgr_close = mdclose,
-		.smgr_destroy = NULL,
-		.smgr_create = mdcreate,
-		.smgr_exists = mdexists,
-		.smgr_unlink = mdunlink,
-		.smgr_extend = mdextend,
-		.smgr_zeroextend = mdzeroextend,
-		.smgr_prefetch = mdprefetch,
-		.smgr_maxcombine = mdmaxcombine,
-		.smgr_readv = mdreadv,
-		.smgr_startreadv = mdstartreadv,
-		.smgr_writev = mdwritev,
-		.smgr_writeback = mdwriteback,
-		.smgr_nblocks = mdnblocks,
-		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_registersync = mdregistersync,
-		.smgr_create_relation_metadata = NULL,
-		.smgr_copy_relation_metadata = NULL,
-		.smgr_sync_relation_metadata = NULL,
-		.smgr_unlink_relation_metadata = NULL,
-		.smgr_createdb_allows_wal_log = NULL,
-		.smgr_checkpoint_database_tablespaces = NULL,
-		.smgr_invalidate_database_tablespaces = NULL,
-		.smgr_fd = mdfd,
-	},
 #ifdef USE_UMBRA
 	/* Umbra storage manager */
 	{
 		.smgr_init = uminit,
 		.smgr_shutdown = NULL,
+		.smgr_before_shmem_exit_cleanup = umbeforeshmemexitcleanup,
 		.smgr_open = umopen,
 		.smgr_close = umclose,
 		.smgr_destroy = umdestroy,
@@ -204,18 +178,64 @@ static const f_smgr smgrsw[] = {
 		.smgr_writev = umwritev,
 		.smgr_writeback = umwriteback,
 		.smgr_nblocks = umnblocks,
+		.smgr_pretruncate = umpretruncate,
 		.smgr_truncate = umtruncate,
 		.smgr_immedsync = umimmedsync,
 		.smgr_registersync = umregistersync,
+		.smgr_fd = umfd,
+		.smgr_is_internal_fork = umisinternalfork,
 		.smgr_create_relation_metadata = umcreaterelationmetadata,
 		.smgr_copy_relation_metadata = umcopyrelationmetadata,
 		.smgr_sync_relation_metadata = umsyncrelationmetadata,
 		.smgr_unlink_relation_metadata = umunlinkrelationmetadata,
+		.smgr_setmapstate = umsetmapstate,
 		.smgr_createdb_allows_wal_log = umcreatedballowswallog,
+		.smgr_init_new_relation = uminitnewrelation,
+		.smgr_redo_create_fork = umredocreatefork,
 		.smgr_checkpoint_database_tablespaces = umcheckpointdatabasetablespaces,
 		.smgr_invalidate_database_tablespaces = uminvalidatedatabasetablespaces,
-		.smgr_fd = umfd,
+		.smgr_mark_skip_wal_pending = ummarkskipwalpending,
+		.smgr_clear_skip_wal_pending = umclearskipwalpending,
+		.smgr_prepare_pendingsync = umpreparependingsync,
+		.smgr_needs_recovery_fsm_vacuum = umneedsrecoveryfsmvacuum,
 	},
+#else
+	/* magnetic disk */
+	{
+		.smgr_init = mdinit,
+		.smgr_shutdown = NULL,
+		.smgr_before_shmem_exit_cleanup = NULL,
+		.smgr_open = mdopen,
+		.smgr_close = mdclose,
+		.smgr_destroy = NULL,
+		.smgr_create = mdcreate,
+		.smgr_exists = mdexists,
+		.smgr_unlink = mdunlink,
+		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
+		.smgr_prefetch = mdprefetch,
+		.smgr_maxcombine = mdmaxcombine,
+		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
+		.smgr_writev = mdwritev,
+		.smgr_writeback = mdwriteback,
+		.smgr_nblocks = mdnblocks,
+		.smgr_pretruncate = NULL,
+		.smgr_truncate = mdtruncate,
+		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
+		.smgr_is_internal_fork = NULL,
+		.smgr_create_relation_metadata = NULL,
+		.smgr_copy_relation_metadata = NULL,
+		.smgr_sync_relation_metadata = NULL,
+		.smgr_unlink_relation_metadata = NULL,
+		.smgr_setmapstate = NULL,
+		.smgr_mark_skip_wal_pending = NULL,
+		.smgr_clear_skip_wal_pending = NULL,
+		.smgr_prepare_pendingsync = NULL,
+		.smgr_needs_recovery_fsm_vacuum = NULL,
+	}
 #endif
 };

@@ -231,6 +251,7 @@ static dlist_head unpinned_relns;

/* local function prototypes */
static void smgrshutdown(int code, Datum arg);
+static void smgrbeforeshmemexit(int code, Datum arg);
static void smgrdestroy(SMgrRelation reln);

static void smgr_aio_reopen(PgAioHandle *ioh);
@@ -243,6 +264,15 @@ const PgAioTargetInfo aio_smgr_target_info = {
.describe_identity = smgr_aio_describe_identity,
};

+bool
+smgrisinternalfork(ForkNumber forknum)
+{
+	if (smgrsw[0].smgr_is_internal_fork == NULL)
+		return false;
+
+	return smgrsw[0].smgr_is_internal_fork(forknum);
+}
+

/*
* smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -271,6 +301,21 @@ smgrinit(void)
on_proc_exit(smgrshutdown, 0);
}

+void
+smgrregistershutdowncleanup(void)
+{
+	static bool registered = false;
+
+	if (registered)
+		return;
+
+	if (smgrsw[0].smgr_before_shmem_exit_cleanup == NULL)
+		return;
+
+	before_shmem_exit(smgrbeforeshmemexit, 0);
+	registered = true;
+}
+
 /*
  * on_proc_exit hook for smgr cleanup during backend shutdown
  */
@@ -290,6 +335,15 @@ smgrshutdown(int code, Datum arg)
 	RESUME_INTERRUPTS();
 }

+static void
+smgrbeforeshmemexit(int code, Datum arg)
+{
+	(void) code;
+	(void) arg;
+
+	smgrsw[0].smgr_before_shmem_exit_cleanup();
+}
+
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
@@ -341,7 +395,7 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = SMGR_DEFAULT;
+		reln->smgr_which = 0;	/* we only have md.c at present */
 		reln->smgr_private = NULL;

 		/* it is not pinned yet */
@@ -582,37 +636,56 @@ smgrsyncrelationmetadata(SMgrRelation reln)
 void
 smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
 {
-	if (smgrsw[SMGR_DEFAULT].smgr_unlink_relation_metadata)
-		smgrsw[SMGR_DEFAULT].smgr_unlink_relation_metadata(rlocator, isRedo);
+	if (smgrsw[0].smgr_unlink_relation_metadata)
+		smgrsw[0].smgr_unlink_relation_metadata(rlocator, isRedo);
+}
+
+void
+smgrsetmapstate(SMgrRelation reln, uint8 map_state)
+{
+	if (smgrsw[reln->smgr_which].smgr_setmapstate)
+		smgrsw[reln->smgr_which].smgr_setmapstate(reln, map_state);
 }

 bool
 smgrcreatedballowswallog(void)
 {
-	if (smgrsw[SMGR_DEFAULT].smgr_createdb_allows_wal_log)
-		return smgrsw[SMGR_DEFAULT].smgr_createdb_allows_wal_log();
+	if (smgrsw[0].smgr_createdb_allows_wal_log)
+		return smgrsw[0].smgr_createdb_allows_wal_log();

return true;
}

+void
+smgrinitnewrelation(SMgrRelation reln, bool needs_wal)
+{
+	if (smgrsw[reln->smgr_which].smgr_init_new_relation)
+		smgrsw[reln->smgr_which].smgr_init_new_relation(reln, needs_wal);
+}
+
+void
+smgrredocreatefork(SMgrRelation reln, ForkNumber forknum, XLogRecPtr lsn)
+{
+	if (smgrsw[reln->smgr_which].smgr_redo_create_fork)
+		smgrsw[reln->smgr_which].smgr_redo_create_fork(reln, forknum, lsn);
+}
+
 void
 smgrcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
 								  const Oid *tablespace_ids)
 {
-	if (smgrsw[SMGR_DEFAULT].smgr_checkpoint_database_tablespaces)
-		smgrsw[SMGR_DEFAULT].smgr_checkpoint_database_tablespaces(dbid,
-																  ntablespaces,
-																  tablespace_ids);
+	if (smgrsw[0].smgr_checkpoint_database_tablespaces)
+		smgrsw[0].smgr_checkpoint_database_tablespaces(dbid, ntablespaces,
+													   tablespace_ids);
 }

 void
 smgrinvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
 								  const Oid *tablespace_ids)
 {
-	if (smgrsw[SMGR_DEFAULT].smgr_invalidate_database_tablespaces)
-		smgrsw[SMGR_DEFAULT].smgr_invalidate_database_tablespaces(dbid,
-																  ntablespaces,
-																  tablespace_ids);
+	if (smgrsw[0].smgr_invalidate_database_tablespaces)
+		smgrsw[0].smgr_invalidate_database_tablespaces(dbid, ntablespaces,
+													   tablespace_ids);
 }

 void
@@ -620,6 +693,50 @@ smgrinvalidatedatabase(Oid dbid)
 {
 	smgrinvalidatedatabasetablespaces(dbid, 0, NULL);
 }
+
+
+
+
+
+
+void
+smgrmarkskipwalpending(RelFileLocator rlocator)
+{
+	SMgrRelation reln;
+
+	reln = smgropen(rlocator, INVALID_PROC_NUMBER);
+	if (smgrsw[reln->smgr_which].smgr_mark_skip_wal_pending)
+		smgrsw[reln->smgr_which].smgr_mark_skip_wal_pending(reln);
+}
+
+void
+smgrclearskipwalpending(RelFileLocator rlocator)
+{
+	SMgrRelation reln;
+
+	reln = smgropen(rlocator, INVALID_PROC_NUMBER);
+	if (smgrsw[reln->smgr_which].smgr_clear_skip_wal_pending)
+		smgrsw[reln->smgr_which].smgr_clear_skip_wal_pending(reln);
+}
+
+bool
+smgrpreparependingsync(SMgrRelation reln)
+{
+	if (smgrsw[reln->smgr_which].smgr_prepare_pendingsync)
+		return smgrsw[reln->smgr_which].smgr_prepare_pendingsync(reln);
+
+	return false;
+}
+
+bool
+smgrneedsrecoveryfsmvacuum(SMgrRelation reln)
+{
+	if (smgrsw[reln->smgr_which].smgr_needs_recovery_fsm_vacuum)
+		return smgrsw[reln->smgr_which].smgr_needs_recovery_fsm_vacuum(reln);
+
+	return true;
+}
+
 /*
  * smgrdosyncall() -- Immediately sync all forks of all given relations
  *
@@ -651,6 +768,9 @@ smgrdosyncall(SMgrRelation *rels, int nrels)

for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
+ if (smgrisinternalfork(forknum))
+ continue;
+
if (smgrsw[which].smgr_exists(rels[i], forknum))
smgrsw[which].smgr_immedsync(rels[i], forknum);
}
@@ -708,7 +828,12 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)

 		/* Close the forks at smgr level */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			if (smgrisinternalfork(forknum))
+				continue;
+
 			smgrsw[which].smgr_close(rels[i], forknum);
+		}
 	}

/*
@@ -735,7 +860,12 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
int which = rels[i]->smgr_which;

 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			if (smgrisinternalfork(forknum))
+				continue;
+
 			smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+		}

smgrunlinkrelationmetadata(rlocators[i], isRedo);
}
@@ -996,6 +1126,45 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
return InvalidBlockNumber;
}

+void
+smgrbumpcachednblocks(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	if (InRecovery &&
+		reln->smgr_cached_nblocks[forknum] == InvalidBlockNumber)
+	{
+		BlockNumber authoritative;
+
+		HOLD_INTERRUPTS();
+		authoritative = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+		RESUME_INTERRUPTS();
+
+		if (authoritative > nblocks)
+			nblocks = authoritative;
+	}
+
+	if (reln->smgr_cached_nblocks[forknum] == InvalidBlockNumber ||
+		reln->smgr_cached_nblocks[forknum] < nblocks)
+		reln->smgr_cached_nblocks[forknum] = nblocks;
+}
+
+void
+smgrpretruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
+				BlockNumber *old_nblocks, BlockNumber *nblocks,
+				XLogRecPtr truncate_lsn)
+{
+	int			i;
+
+	if (smgrsw[reln->smgr_which].smgr_pretruncate == NULL)
+		return;
+
+	for (i = 0; i < nforks; i++)
+	{
+		smgrsw[reln->smgr_which].smgr_pretruncate(reln, forknum[i],
+												  old_nblocks[i], nblocks[i],
+												  truncate_lsn);
+	}
+}
+
 /*
  * smgrtruncate() -- Truncate the given forks of supplied relation to
  *					 each specified numbers of blocks
@@ -1177,7 +1346,8 @@ void
 pgaio_io_set_target_smgr(PgAioHandle *ioh,
 						 SMgrRelationData *smgr,
 						 ForkNumber forknum,
-						 BlockNumber blocknum,
+						 BlockNumber logical_blocknum,
+						 BlockNumber physical_blocknum,
 						 int nblocks,
 						 bool skip_fsync)
 {
@@ -1188,7 +1358,8 @@ pgaio_io_set_target_smgr(PgAioHandle *ioh,
 	/* backend is implied via IO owner */
 	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
 	sd->smgr.forkNum = forknum;
-	sd->smgr.blockNum = blocknum;
+	sd->smgr.blockNum = logical_blocknum;
+	sd->smgr.physBlockNum = physical_blocknum;
 	sd->smgr.nblocks = nblocks;
 	sd->smgr.is_temp = SmgrIsTemp(smgr);
 	/* Temp relations should never be fsync'd */
@@ -1226,11 +1397,13 @@ smgr_aio_reopen(PgAioHandle *ioh)
 			pg_unreachable();
 			break;
 		case PGAIO_OP_READV:
-			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum,
+								 sd->smgr.physBlockNum, &off);
 			Assert(off == od->read.offset);
 			break;
 		case PGAIO_OP_WRITEV:
-			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum,
+								  sd->smgr.physBlockNum, &off);
 			Assert(off == od->write.offset);
 			break;
 	}
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index bbb870ab8e..2baf64defe 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -1,83 +1,270 @@
 /*-------------------------------------------------------------------------
  *
  * umbra.c
- *	  Umbra storage manager skeleton.
+ *    Umbra storage manager: MAP translation + physical segment manager.
  *
- * This file establishes Umbra as a separate smgr implementation from md.c. It
- * maintains relation-local metadata and MAP checkpoint/cache state while using
- * md.c for data-fork I/O and umfile for metadata-file I/O.
+ * Umbra implements a separate smgr that translates logical block numbers to
+ * physical block numbers for mapped data forks.
  *
- * src/backend/storage/smgr/umbra.c
+ * The mapping is stored in Umbra's internal metadata fork and cached in
+ * shared memory by the MAP subsystem (src/backend/storage/map/).
+ *
+ * Layering in this file is intentionally split:
+ *   1. access semantics: classify relation/fork access state
+ *   2. mapping facts: consume MAP lookups/logical EOF/frontier facts
+ *   3. execution: issue physical file I/O
+ *
+ * map.c owns facts and metadata storage actions. umbra.c owns runtime
+ * interpretation of those facts for reads, writes, and publication.
+ *
+ * For correctness create/open establishes a steady-state base policy once:
+ *   - permanent mapped relations: REQUIRE_MAP
+ *   - temp/unlogged/direct relations: BYPASS_MAP
+ *
+ * INIT and MAP forks always use direct physical addressing.
  *
  *-------------------------------------------------------------------------
  */
+
 #include "postgres.h"

+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_class.h"
+#include "catalog/storage.h"
 #include "common/relpath.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/map.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "storage/umfile.h"
 #include "storage/umbra.h"
+#include "storage/umfile.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/wait_event.h"

-typedef struct UmbraSmgrRelationState
+typedef struct UmbraAccessState
+{
+	UmbraMapPolicy policy;
+	bool		map_available;
+} UmbraAccessState;
+
+typedef struct UmbraRelationState
+{
+	UmbraFileContext *file_ctx;	/* cached borrow from umfile registry */
+	uint8		map_state;
+} UmbraRelationState;
+
+typedef struct UmbraAccessLookupState
+{
+	bool		have_logical_nblocks;
+	BlockNumber logical_nblocks;
+} UmbraAccessLookupState;
+
+typedef enum UmbraAccessResolveMode
+{
+	UMBRA_ACCESS_RESOLVE_READ,
+	UMBRA_ACCESS_RESOLVE_WRITE,
+	UMBRA_ACCESS_RESOLVE_WRITEBACK
+} UmbraAccessResolveMode;
+
+typedef enum UmbraAccessResolveResult
+{
+	UMBRA_ACCESS_RESOLVED_PBLK,
+	UMBRA_ACCESS_RESOLVED_ZERO,
+	UMBRA_ACCESS_RESOLVED_SKIP
+} UmbraAccessResolveResult;
+
+typedef struct UmbraMappedBirthResult
+{
+	BlockNumber pblkno;
+	bool		mapping_published;
+} UmbraMappedBirthResult;
+
+#define UMBRA_WRITE_BARRIER_WAIT_RETRIES 10000
+#define UMBRA_WRITE_BARRIER_WAIT_USEC 1000
+
+static inline UmbraRelationState *
+um_state_lookup(SMgrRelation reln)
+{
+	return (UmbraRelationState *) reln->smgr_private;
+}
+
+static inline UmbraRelationState *
+um_state_acquire(SMgrRelation reln)
+{
+	UmbraRelationState *state = (UmbraRelationState *) reln->smgr_private;
+
+	if (state == NULL)
+	{
+		state = MemoryContextAllocZero(TopMemoryContext, sizeof(*state));
+		state->file_ctx = umfile_ctx_acquire(reln->smgr_rlocator);
+		state->map_state = UMBRA_MAP_POLICY_UNKNOWN;
+		reln->smgr_private = state;
+	}
+	else if (state->file_ctx == NULL)
+		state->file_ctx = umfile_ctx_acquire(reln->smgr_rlocator);
+
+	return state;
+}
+
+static inline UmbraFileContext *
+um_ctx_acquire(SMgrRelation reln)
 {
-	UmbraFileContext *filectx;
-} UmbraSmgrRelationState;
+	return um_state_acquire(reln)->file_ctx;
+}
+
+static inline UmbraFileContext *
+um_ctx_lookup(SMgrRelation reln)
+{
+	UmbraRelationState *state = um_state_lookup(reln);
+
+	return state != NULL ? state->file_ctx : NULL;
+}
+
+static inline UmbraMapPolicy
+um_map_state_cached(SMgrRelation reln)
+{
+	UmbraRelationState *state = um_state_lookup(reln);
+
+	if (state == NULL)
+		return UMBRA_MAP_POLICY_UNKNOWN;
+
+	return (UmbraMapPolicy) state->map_state;
+}
+
+static inline void
+um_set_cached_map_state(SMgrRelation reln, UmbraMapPolicy map_state)
+{
+	UmbraRelationState *state;
+
+	Assert(map_state != UMBRA_MAP_POLICY_UNKNOWN);
+	state = um_state_acquire(reln);
+	state->map_state = (uint8) map_state;
+}
+
+static inline void
+um_state_destroy(SMgrRelation reln)
+{
+	UmbraRelationState *state = um_state_lookup(reln);
+
+	if (state == NULL)
+		return;

-static bool um_tracks_identity_metadata(ForkNumber forknum);
-static UmbraFileContext *um_relation_filectx(SMgrRelation reln);
-static void um_ensure_redo_metadata(SMgrRelation reln, ForkNumber forknum);
-static void um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
-										BlockNumber nblocks, bool fork_exists);
-static void um_refresh_identity_metadata(SMgrRelation reln);
+	reln->smgr_private = NULL;
+	pfree(state);
+}
+
+/* Runtime access semantics. */
+static UmbraMapPolicy um_map_policy_for_access(SMgrRelation reln,
+												   ForkNumber forknum);
+static UmbraAccessState um_classify_access(SMgrRelation reln,
+										   ForkNumber forknum);
+static void um_report_unmapped_map_entry(SMgrRelation reln,
+										 ForkNumber forknum,
+										 const UmbraAccessState *access,
+										 BlockNumber lblkno);
+static BlockNumber umnblocks_for_access(SMgrRelation reln, ForkNumber forknum,
+										const UmbraAccessState *access);
+static bool um_lblk_precedes_logical_eof_for_access(SMgrRelation reln,
+													ForkNumber forknum,
+													const UmbraAccessState *access,
+													UmbraAccessLookupState *lookup_state,
+													BlockNumber lblkno);
+static bool um_is_logical_unmaterialized_for_access(SMgrRelation reln,
+													ForkNumber forknum,
+													const UmbraAccessState *access,
+													BlockNumber lblkno);
+static UmbraAccessResolveResult um_resolve_lblk_for_access(SMgrRelation reln,
+														   ForkNumber forknum,
+														   const UmbraAccessState *access,
+														   UmbraAccessLookupState *lookup_state,
+														   BlockNumber lblkno,
+														   UmbraAccessResolveMode mode,
+														   BlockNumber *pblkno);
+static BlockNumber um_resolve_mapped_read_run(SMgrRelation reln,
+											  ForkNumber forknum,
+											  const UmbraAccessState *access,
+											  UmbraAccessLookupState *lookup_state,
+											  BlockNumber blocknum,
+											  BlockNumber maxblocks,
+											  BlockNumber *start_pblk);
+static void um_complete_zero_readv(PgAioHandle *ioh, SMgrRelation reln,
+								   ForkNumber forknum, BlockNumber blocknum,
+								   void *buffer);
+static bool um_is_stale_post_truncate_lblk_for_access(SMgrRelation reln,
+													  ForkNumber forknum,
+													  const UmbraAccessState *access,
+													  UmbraAccessLookupState *lookup_state,
+													  BlockNumber lblkno);
+static bool um_is_stale_post_truncate_lblk_with_eof(ForkNumber forknum,
+											BlockNumber logical_nblocks,
+											BlockNumber lblkno);
+
+/* MAP facts and storage actions consumed by Umbra semantics. */
+static void um_ensure_datafork_batch_ready_for_access(SMgrRelation reln,
+													  ForkNumber forknum,
+													  const UmbraAccessState *access,
+													  BlockNumber pblkno,
+													  bool skipFsync);
+static void um_reserve_fresh_pblkno_for_access(SMgrRelation reln,
+											   ForkNumber forknum,
+											   const UmbraAccessState *access,
+											   BlockNumber lblkno,
+											   BlockNumber *new_pblkno);
+static bool um_fork_uses_map_translation(ForkNumber forknum);
+static bool um_mapped_exists_from_super(SMgrRelation reln, ForkNumber forknum);
+static UmbraMapPolicy um_open_map_state(SMgrRelation reln);
+static bool um_state_uses_map(UmbraMapPolicy state);
+static bool um_state_requires_durable_sync(UmbraMapPolicy state);
+static bool um_relation_requires_durable_sync(SMgrRelation reln);
+static UmbraMappedBirthResult um_publish_mapped_birth(SMgrRelation reln,
+													  ForkNumber forknum,
+													  const UmbraAccessState *access,
+													  BlockNumber lblkno,
+													  bool allow_wal_owned_firstborn);
 static void um_filetag_path(const FileTag *ftag, char *path);

 bool
 UmMetadataExists(SMgrRelation reln)
 {
-	return umfile_exists(um_relation_filectx(reln),
-						 UMBRA_METADATA_FORKNUM,
+	return umfile_exists(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM,
 						 UMFILE_EXISTS_DENSE);
 }

 bool
 UmMetadataOpenOrCreate(SMgrRelation reln, bool isRedo, bool *created)
 {
-	return umfile_open_or_create(um_relation_filectx(reln),
-								 UMBRA_METADATA_FORKNUM,
-								 isRedo,
-								 created);
+	return umfile_open_or_create(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM,
+								 isRedo, created);
 }

 BlockNumber
 UmMetadataNblocks(SMgrRelation reln)
 {
-	return umfile_nblocks(um_relation_filectx(reln),
-						  UMBRA_METADATA_FORKNUM,
+	return umfile_nblocks(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM,
 						  UMFILE_NBLOCKS_DENSE);
 }

 void
 UmMetadataRead(SMgrRelation reln, BlockNumber blkno, void *buffer)
 {
-	void	   *buffers[1];
-
-	buffers[0] = buffer;
-	umfile_readv(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
-				 buffers, 1);
+	umfile_readv(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM, blkno,
+				 &buffer, 1);
 }

 void
 UmMetadataWrite(SMgrRelation reln, BlockNumber blkno, const void *buffer,
 				bool skipFsync)
 {
-	UmbraFileContext *ctx = um_relation_filectx(reln);
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

umfile_ctx_write(ctx, UMBRA_METADATA_FORKNUM, blkno,
buffer, BLCKSZ, skipFsync);
@@ -94,7 +281,8 @@ UmMetadataWriteSuperblock(RelFileLocatorBackend rlocator, const void *sector,

 	/*
 	 * Superblock checkpoint flush can run while holding MapSuperEntry->lock,
-	 * so it must not recurse through smgr/umopen.
+	 * so it must not reopen the relation via smgr/umopen and recurse into MAP
+	 * state lookup.
 	 */
 	umfile_ctx_write(ctx, UMBRA_METADATA_FORKNUM, MAP_BLOCK_SUPER,
 					 sector, MAP_SUPERBLOCK_SIZE, skipFsync);
@@ -107,15 +295,20 @@ void
 UmMetadataExtend(SMgrRelation reln, BlockNumber blkno, const void *buffer,
 				 bool skipFsync)
 {
-	umfile_extend(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM, blkno,
+	umfile_extend(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM, blkno,
 				  buffer, skipFsync);
 }

 void
 UmMetadataImmediateSync(SMgrRelation reln)
 {
-	MapCheckpointRelation(reln->smgr_rlocator.locator);
-	umfile_immedsync(um_relation_filectx(reln), UMBRA_METADATA_FORKNUM);
+	umimmedsync(reln, UMBRA_METADATA_FORKNUM);
+}
+
+void
+UmMetadataRegisterSync(SMgrRelation reln)
+{
+	umimmedsync(reln, UMBRA_METADATA_FORKNUM);
 }

void
@@ -124,66 +317,139 @@ UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo)
umfile_unlink(rlocator, UMBRA_METADATA_FORKNUM, isRedo);
}

-void
-UmInvalidateDatabase(Oid dbid)
+static void
+um_ensure_datafork_batch_ready_for_access(SMgrRelation reln,
+										  ForkNumber forknum,
+										  const UmbraAccessState *access,
+										  BlockNumber pblkno,
+										  bool skipFsync)
 {
-	FileTag		tag;
-	RelFileLocator rlocator;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-	MapInvalidateDatabase(dbid);
+	if (!access->map_available)
+		return;

-	rlocator.spcOid = 0;
-	rlocator.dbOid = dbid;
-	rlocator.relNumber = 0;
+	if (pblkno == InvalidBlockNumber)
+		return;

-	memset(&tag, 0, sizeof(tag));
-	tag.handler = SYNC_HANDLER_UMBRA;
-	tag.rlocator = rlocator;
-	tag.forknum = InvalidForkNumber;
-	tag.segno = InvalidBlockNumber;
+	if (pblkno + 1 == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("physical block overflow for relation %u/%u/%u fork %d",
+						reln->smgr_rlocator.locator.spcOid,
+						reln->smgr_rlocator.locator.dbOid,
+						reln->smgr_rlocator.locator.relNumber,
+						forknum)));

-	RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true);
+	(void) MapSBlockEnsurePhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+										  forknum, pblkno + 1, skipFsync);
 }

 void
-uminit(void)
+UmApplyReservedRangeRemap(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber firstblock, BlockNumber nblocks,
+						  const BlockNumber *pblknos,
+						  XLogRecPtr lsn, bool skipFsync)
 {
-	umfile_init();
-	MapBackendInit();
-}
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber max_lblkno;
+	BlockNumber max_pblkno = InvalidBlockNumber;

-void
-umopen(SMgrRelation reln)
-{
-	UmbraSmgrRelationState *state;
+	Assert(nblocks > 0);
+	Assert(pblknos != NULL);
+
+	max_lblkno = firstblock + nblocks - 1;
+	if (max_lblkno < firstblock)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("logical range overflow for relation %u/%u/%u fork %d",
+						reln->smgr_rlocator.locator.spcOid,
+						reln->smgr_rlocator.locator.dbOid,
+						reln->smgr_rlocator.locator.relNumber,
+						forknum)));
+
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber pblk = pblknos[i];
+
+		if (max_pblkno == InvalidBlockNumber || pblk > max_pblkno)
+			max_pblkno = pblk;
+	}
+
+	if (max_pblkno != InvalidBlockNumber)
+	{
+		(void) MapSBlockEnsurePhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+											  forknum, max_pblkno + 1, skipFsync);
+	}

-	Assert(reln->smgr_private == NULL);
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = firstblock + i;
+		BlockNumber pblk = pblknos[i];

-	state = MemoryContextAllocZero(TopMemoryContext,
-								   sizeof(UmbraSmgrRelationState));
-	state->filectx = umfile_ctx_acquire(reln->smgr_rlocator);
-	reln->smgr_private = state;
+		MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum, lblk, pblk, lsn);
+	}

-	mdopen(reln);
+	if (max_pblkno != InvalidBlockNumber)
+	{
+		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+									 forknum, max_pblkno + 1, lsn);
+		for (BlockNumber i = 0; i < nblocks; i++)
+			MapInflightRelease(reln->smgr_rlocator.locator, forknum,
+							   firstblock + i);
+	}
+	MapSBlockBumpLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+									forknum, max_lblkno + 1, lsn);
 }

-void
-umclose(SMgrRelation reln, ForkNumber forknum)
+bool
+umapplyreservedrange(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber firstblock, BlockNumber nblocks,
+					 const BlockNumber *pblknos,
+					 XLogRecPtr lsn, bool skipFsync)
 {
-	mdclose(reln, forknum);
+	UmApplyReservedRangeRemap(reln, forknum, firstblock, nblocks,
+							  pblknos, lsn, skipFsync);
+	return true;
 }

-void
-umdestroy(SMgrRelation reln)
+/*
+ * Create and initialize MAP fork for a relation.
+ *
+ * Keep creation O(1): create/open the MAP fork and write only the superblock
+ * sector. Regular MAP pages are synthesized on first access and written
+ * lazily by the MAP layer.
+ */
+
+static void
+ummapcreate(SMgrRelation reln)
 {
-	UmbraSmgrRelationState *state = reln->smgr_private;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	bool		newly_created;

-	if (state != NULL)
-	{
-		umfile_ctx_forget(reln->smgr_rlocator);
-		pfree(state);
-		reln->smgr_private = NULL;
-	}
+	/*
+	 * Open existing MAP fork or create new one. During redo, EEXIST is
+	 * acceptable and we reuse the existing file.
+	 */
+	UmMetadataOpenOrCreate(reln, true /* isRedo */, &newly_created);
+
+	/* Existing MAP fork does not need re-initialization. */
+	if (!newly_created)
+		return;
+
+	/*
+	 * Superblock identity is established locally here and reconstructed during
+	 * redo from relation creation, so no separate Umbra rmgr record is needed.
+	 */
+	Assert(um_map_state_cached(reln) != UMBRA_MAP_POLICY_UNKNOWN);
+	MapSBlockInit(ctx, reln->smgr_rlocator.locator, InvalidXLogRecPtr);
+
+	/*
+	 * Keep metadata fork durability aligned with main-fork create semantics:
+	 * the file is created now, while checkpoint/restartpoint owns syncing it.
+	 */
+	if (!SmgrIsTemp(reln))
+		UmMetadataRegisterSync(reln);
 }

bool
@@ -199,49 +465,60 @@ umcreatedballowswallog(void)
}

 void
-umcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
-								const Oid *tablespace_ids)
+uminitnewrelation(SMgrRelation reln, bool needs_wal)
 {
-	MapCheckpointDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
+	umsetmapstate(reln, needs_wal ?
+				  UMBRA_MAP_POLICY_REQUIRE_MAP :
+				  UMBRA_MAP_POLICY_BYPASS_MAP);
+	if (needs_wal)
+		umcreaterelationmetadata(reln);
 }

 void
-uminvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
-								const Oid *tablespace_ids)
+umcreaterelationmetadata(SMgrRelation reln)
 {
-	MapInvalidateDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
+	ummapcreate(reln);
 }

 void
-umcreaterelationmetadata(SMgrRelation reln)
+umredocreatefork(SMgrRelation reln, ForkNumber forknum, XLogRecPtr lsn)
 {
-	UmbraFileContext *ctx = um_relation_filectx(reln);
-	bool		created = false;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-	/*
-	 * smgrcreaterelationmetadata() is used both in normal create and redo
-	 * paths, so tolerate an already-existing metadata fork here.
-	 */
-	if (!UmMetadataOpenOrCreate(reln, true, &created))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create Umbra metadata fork for relation %u/%u/%u",
-						reln->smgr_rlocator.locator.spcOid,
-						reln->smgr_rlocator.locator.dbOid,
-						reln->smgr_rlocator.locator.relNumber)));
+	if (forknum == MAIN_FORKNUM)
+	{
+		umsetmapstate(reln, UMBRA_MAP_POLICY_REQUIRE_MAP);
+		umcreaterelationmetadata(reln);
+		return;
+	}

-	elog(DEBUG1, "umbra metadata open/create %u/%u/%u created=%s",
-		 reln->smgr_rlocator.locator.spcOid,
-		 reln->smgr_rlocator.locator.dbOid,
-		 reln->smgr_rlocator.locator.relNumber,
-		 created ? "true" : "false");
+	if (!UmbraForkUsesMapTranslation(forknum))
+	{
+		umsetmapstate(reln, UMBRA_MAP_POLICY_BYPASS_MAP);
+		return;
+	}

-	if (created)
-		MapSBlockInit(ctx, reln->smgr_rlocator.locator, InvalidXLogRecPtr);
-	else
-		(void) MapSBlockEnsureLoaded(ctx, reln->smgr_rlocator.locator);
+	umsetmapstate(reln, UMBRA_MAP_POLICY_REQUIRE_MAP);
+	if (!UmMetadataExists(reln))
+		umcreaterelationmetadata(reln);
+
+	if (UmbraForkIsAuxiliaryMapped(forknum))
+		MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+								   forknum, 0, lsn);
+}
+
+void
+umcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
+								const Oid *tablespace_ids)
+{
+	MapCheckpointDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
+}

-	um_refresh_identity_metadata(reln);
+void
+uminvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
+								const Oid *tablespace_ids)
+{
+	MapInvalidateDatabaseTablespaces(dbid, ntablespaces, tablespace_ids);
 }

void
@@ -257,7 +534,7 @@ umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst, char relpersistence)
if (!UmMetadataExists(src))
return;

-	umcreaterelationmetadata(dst);
+	ummapcreate(dst);

src_nblocks = UmMetadataNblocks(src);
dst_nblocks = UmMetadataNblocks(dst);
@@ -291,333 +568,795 @@ umunlinkrelationmetadata(RelFileLocatorBackend rlocator, bool isRedo)
}

 void
-umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+uminit(void)
 {
-	mdcreate(reln, forknum, isRedo);
-
-	/*
-	 * Redo for permanent relation creation reaches smgrcreate() directly, so
-	 * make sure the metadata fork exists before later recovery steps touch the
-	 * relation again.
-	 */
-	if (isRedo &&
-		forknum == MAIN_FORKNUM &&
-		!UmMetadataExists(reln))
-		umcreaterelationmetadata(reln);
-
-	if (forknum != MAIN_FORKNUM &&
-		um_tracks_identity_metadata(forknum) &&
-		UmMetadataExists(reln))
-		um_identity_update_metadata(reln, forknum, 0, true);
+	umfile_init();
+	MapBackendInit();
 }

-bool
-umexists(SMgrRelation reln, ForkNumber forknum)
+void
+umbeforeshmemexitcleanup(void)
 {
-	if (forknum == UMBRA_METADATA_FORKNUM)
-		return UmMetadataExists(reln);
-
-	return mdexists(reln, forknum);
+	MapBackendExitCleanup();
 }

 void
-umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+umopen(SMgrRelation reln)
 {
-	if (forknum == UMBRA_METADATA_FORKNUM ||
-		forknum == MAIN_FORKNUM ||
-		forknum == InvalidForkNumber)
-	{
-		MapInvalidateRelation(rlocator.locator);
-	}
+	(void) um_ctx_acquire(reln);

-	if (forknum == UMBRA_METADATA_FORKNUM)
-	{
-		UmMetadataUnlink(rlocator, isRedo);
+	if (um_map_state_cached(reln) != UMBRA_MAP_POLICY_UNKNOWN)
 		return;
-	}
-
-	if (forknum == MAIN_FORKNUM || forknum == InvalidForkNumber)
-		UmMetadataUnlink(rlocator, isRedo);

-	mdunlink(rlocator, forknum, isRedo);
+	um_set_cached_map_state(reln, um_open_map_state(reln));
 }

 void
-umextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void *buffer, bool skipFsync)
+umclose(SMgrRelation reln, ForkNumber forknum)
 {
-	um_ensure_redo_metadata(reln, forknum);
-	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+	UmbraFileContext *ctx = um_ctx_lookup(reln);

-	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
-		um_identity_update_metadata(reln, forknum, blocknum + 1, true);
+	if (ctx != NULL)
+		umfile_ctx_close_fork(ctx, forknum);
 }

 void
-umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-			 int nblocks, bool skipFsync)
+umdestroy(SMgrRelation reln)
 {
-	BlockNumber	target_nblocks;
-
-	um_ensure_redo_metadata(reln, forknum);
-	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
-
-	if (!um_tracks_identity_metadata(forknum) || !UmMetadataExists(reln))
-		return;
-
-	target_nblocks = blocknum + (BlockNumber) nblocks;
-	if (target_nblocks < blocknum)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("Umbra identity mapping block count overflow")));
-
-	um_identity_update_metadata(reln, forknum, target_nblocks, true);
+	umfile_ctx_forget(reln->smgr_rlocator);
+	um_state_destroy(reln);
 }

-bool
-umprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		   int nblocks)
+void
+umsetmapstate(SMgrRelation reln, uint8 map_state)
 {
-	return mdprefetch(reln, forknum, blocknum, nblocks);
+	Assert(map_state != UMBRA_MAP_POLICY_UNKNOWN);
+	Assert(map_state <= UMBRA_MAP_POLICY_REQUIRE_MAP);
+	um_set_cached_map_state(reln, (UmbraMapPolicy) map_state);
 }

-uint32
-ummaxcombine(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+static bool
+um_fork_uses_map_translation(ForkNumber forknum)
 {
-	return mdmaxcombine(reln, forknum, blocknum);
+	return UmbraForkUsesMapTranslation(forknum);
 }

-void
-umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		void **buffers, BlockNumber nblocks)
+/*
+ * MAIN fork is the only one that uses page-WAL-owned first-born remap.
+ *
+ * FSM/VM growth is much more structured: their extend/truncate producers are
+ * concentrated in a few helper paths, so we keep them on explicit mapping
+ * publication rather than tying first-born ownership to arbitrary page WAL.
+ */
+static bool
+um_mapped_exists_from_super(SMgrRelation reln, ForkNumber forknum)
 {
-	um_ensure_redo_metadata(reln, forknum);
-	mdreadv(reln, forknum, blocknum, buffers, nblocks);
-}
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-void
-umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum, void **buffers, BlockNumber nblocks)
-{
-	um_ensure_redo_metadata(reln, forknum);
-	mdstartreadv(ioh, reln, forknum, blocknum, buffers, nblocks);
+	Assert(um_fork_uses_map_translation(forknum));
+
+	if (!UmMetadataExists(reln))
+		return false;
+
+	return MapSBlockForkExists(ctx, reln->smgr_rlocator.locator, forknum);
 }

-void
-umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void **buffers, BlockNumber nblocks, bool skipFsync)
+static UmbraMapPolicy
+um_open_map_state(SMgrRelation reln)
 {
-	um_ensure_redo_metadata(reln, forknum);
-	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-	if (InRecovery &&
-		um_tracks_identity_metadata(forknum) &&
-		UmMetadataExists(reln))
-		um_identity_update_metadata(reln, forknum, mdnblocks(reln, forknum),
-									true);
+	if (RelFileLocatorBackendIsTemp(reln->smgr_rlocator) ||
+		!UmMetadataExists(reln))
+		return UMBRA_MAP_POLICY_BYPASS_MAP;
+
+	if (MapSBlockIsSkipWalPending(ctx,
+								  reln->smgr_rlocator.locator))
+		return UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP;
+
+	return UMBRA_MAP_POLICY_REQUIRE_MAP;
 }

-void
-umwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks)
+static bool
+um_state_uses_map(UmbraMapPolicy state)
 {
-	mdwriteback(reln, forknum, blocknum, nblocks);
+	return state == UMBRA_MAP_POLICY_REQUIRE_MAP;
 }

-BlockNumber
-umnblocks(SMgrRelation reln, ForkNumber forknum)
+static bool
+um_state_requires_durable_sync(UmbraMapPolicy state)
 {
-	/*
-	 * Keep md.c responsible for physical fork size queries. mdtruncate()
-	 * relies on a preceding mdnblocks() call to have opened active segments.
-	 */
-	return mdnblocks(reln, forknum);
+	return state == UMBRA_MAP_POLICY_REQUIRE_MAP ||
+		state == UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP;
 }

-void
-umtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber old_blocks, BlockNumber nblocks)
+/*
+ * Commit-time durability is a separate question from ordinary access state.
+ *
+ * log_newpage_range() can describe only physical page images.  Once a relation
+ * either actively uses MAP translation or already owns a MAP fork on disk, its
+ * durable transition must go through the Umbra-aware flush+sync path instead
+ * of plain FPI-range logging.
+ */
+static bool
+um_relation_requires_durable_sync(SMgrRelation reln)
 {
-	mdtruncate(reln, forknum, old_blocks, nblocks);
+	UmbraMapPolicy state;
+
+	state = um_map_state_cached(reln);
+	Assert(state != UMBRA_MAP_POLICY_UNKNOWN);

-	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
-		um_identity_update_metadata(reln, forknum, nblocks, true);
+	return um_state_requires_durable_sync(state) || UmMetadataExists(reln);
 }

-void
-umimmedsync(SMgrRelation reln, ForkNumber forknum)
+static UmbraMapPolicy
+um_map_policy_for_access(SMgrRelation reln, ForkNumber forknum)
 {
-	mdimmedsync(reln, forknum);
+	if (!um_fork_uses_map_translation(forknum))
+		return UMBRA_MAP_POLICY_BYPASS_MAP;

-	if (um_tracks_identity_metadata(forknum) && UmMetadataExists(reln))
-		UmMetadataImmediateSync(reln);
+	Assert(um_map_state_cached(reln) != UMBRA_MAP_POLICY_UNKNOWN);
+	return um_map_state_cached(reln);
 }

-void
-umregistersync(SMgrRelation reln, ForkNumber forknum)
+/*
+ * um_map_fork_available() -- whether MAP fork can be used for this access.
+ *
+ * If MAP fork is absent:
+ * - optional relations fall back to direct mapping (lblkno==pblkno)
+ * - required relations throw ERROR
+ */
+static UmbraAccessState
+um_classify_access(SMgrRelation reln, ForkNumber forknum)
 {
-	mdregistersync(reln, forknum);
-}
+	UmbraAccessState state;

-int
-umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
-{
-	return mdfd(reln, forknum, blocknum, off);
+	state.policy = um_map_policy_for_access(reln, forknum);
+	state.map_available = um_state_uses_map(state.policy);
+
+	return state;
 }

-int
-umsyncfiletag(const FileTag *ftag, char *path)
+static void
+um_report_unmapped_map_entry(SMgrRelation reln, ForkNumber forknum,
+							 const UmbraAccessState *access,
+							 BlockNumber lblkno)
 {
-	File		fd;
-	int			ret;
-	int			save_errno;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber logical_nblocks = InvalidBlockNumber;
+	BlockNumber map_blkno;
+	BlockNumber fork_page_idx;
+	int			entry_idx;
+	uint64		blkno64;

-	um_filetag_path(ftag, path);
+	Assert(access->map_available);

-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
-	if (fd < 0)
-		return -1;
+	(void) MapSBlockTryGetLogicalNblocks(ctx,
+										 reln->smgr_rlocator.locator,
+										 forknum, &logical_nblocks);

-	ret = FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC);
-	save_errno = errno;
+	fork_page_idx = lblkno / MAP_ENTRIES_PER_PAGE;
+	entry_idx = lblkno % MAP_ENTRIES_PER_PAGE;

-	FileClose(fd);
-	errno = save_errno;
-	return ret;
+	switch (forknum)
+	{
+		case FSM_FORKNUM:
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				(uint64) fork_page_idx * (uint64) MAP_GROUP_TOTAL_PAGES;
+			break;
+
+		case VISIBILITYMAP_FORKNUM:
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				(uint64) fork_page_idx * (uint64) MAP_GROUP_TOTAL_PAGES +
+				(uint64) MAP_GROUP_FSM_PAGES;
+			break;
+
+		case MAIN_FORKNUM:
+		{
+			uint64 group_page_idx = (uint64) fork_page_idx;
+			uint64 group_no = group_page_idx / (uint64) MAP_GROUP_MAIN_PAGES;
+
+			blkno64 = (uint64) MAP_BLOCK_FIRST_GROUP +
+				group_no * (uint64) MAP_GROUP_TOTAL_PAGES +
+				(uint64) MAP_GROUP_FSM_PAGES +
+				(uint64) MAP_GROUP_VM_PAGES +
+				(group_page_idx % (uint64) MAP_GROUP_MAIN_PAGES);
+			break;
+		}
+
+		default:
+			elog(ERROR, "unsupported fork number %d in map miss report",
+				 (int) forknum);
+			pg_unreachable();
+	}
+
+	map_blkno = (BlockNumber) blkno64;
+
+	ereport(ERROR,
+			(errcode(ERRCODE_DATA_CORRUPTED),
+			 errmsg("MAP entry is unmapped: rel=%u/%u/%u fork=%d lblk=%u logical_nblocks=%u map_page=%u entry_idx=%d",
+					reln->smgr_rlocator.locator.spcOid,
+					reln->smgr_rlocator.locator.dbOid,
+					reln->smgr_rlocator.locator.relNumber,
+					forknum,
+					lblkno,
+					logical_nblocks,
+					map_blkno,
+					entry_idx)));
 }

-int
-umunlinkfiletag(const FileTag *ftag, char *path)
+/*
+ * Build identity MAP metadata for relations that stayed on direct lblk==pblk
+ * access during a skip-WAL window and now need durable mapped state.
+ */
+void
+UmRebuildMapAndSuperblockForSkipWAL(SMgrRelation reln)
 {
-	um_filetag_path(ftag, path);
-	return unlink(path);
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	/*
+	 * Rebuild assumes the relation stayed on direct lblk==pblk access during
+	 * the skip-WAL window. No running path may consume MAP state before this
+	 * durable-transition rebuild runs.
+	 */
+	Assert(RelFileLocatorSkippingWAL(reln->smgr_rlocator.locator));
+
+	MapInvalidateRelation(reln->smgr_rlocator.locator);
+	Assert(UmMetadataExists(reln));
+
+	for (ForkNumber forknum = MAIN_FORKNUM; forknum <= VISIBILITYMAP_FORKNUM; forknum++)
+	{
+		BlockNumber nblocks;
+
+		if (!UmbraForkUsesMapTranslation(forknum))
+			continue;
+
+		if (!umfile_exists(ctx, forknum, UMFILE_EXISTS_DENSE))
+		{
+			MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+									   forknum, 0, InvalidXLogRecPtr);
+			continue;
+		}
+
+		nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+
+		for (BlockNumber lblk = 0; lblk < nblocks; lblk++)
+			MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum,
+						  lblk, lblk, InvalidXLogRecPtr);
+
+		if (nblocks > 0)
+		{
+			MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
+										   forknum, nblocks, InvalidXLogRecPtr);
+			MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+										 forknum, nblocks, InvalidXLogRecPtr);
+		}
+		MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+								   forknum, nblocks, InvalidXLogRecPtr);
+	}
 }

-bool
-umfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+void
+ummarkskipwalpending(SMgrRelation reln)
 {
-	if (ftag->forknum == InvalidForkNumber &&
-		ftag->segno == InvalidBlockNumber &&
-		ftag->rlocator.spcOid == 0 &&
-		ftag->rlocator.relNumber == 0)
-		return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-	if (ftag->forknum == InvalidForkNumber &&
-		ftag->segno == InvalidBlockNumber)
-		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator);
+	if (!UmMetadataExists(reln))
+		ummapcreate(reln);

-	if (ftag->segno == InvalidBlockNumber)
-		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
-			ftag->forknum == candidate->forknum;
+	MapSBlockSetSkipWalPending(ctx, reln->smgr_rlocator.locator,
+							   true, InvalidXLogRecPtr);
+	um_set_cached_map_state(reln, UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP);
+}

-	return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
-		ftag->forknum == candidate->forknum &&
-		ftag->segno == candidate->segno;
+void
+umclearskipwalpending(SMgrRelation reln)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	if (!UmMetadataExists(reln))
+		return;
+
+	MapSBlockSetSkipWalPending(ctx, reln->smgr_rlocator.locator,
+							   false, InvalidXLogRecPtr);
+	um_set_cached_map_state(reln, UMBRA_MAP_POLICY_REQUIRE_MAP);
+	/*
+	 * Clearing the durable-transition flag must itself be durable before the
+	 * transaction can be considered to have left the skip-WAL state.  If we
+	 * only dirty the shared superblock copy here, a crash before checkpoint
+	 * would resurrect SKIP_WAL_PENDING from disk on restart.
+	 */
+	umimmedsync(reln, UMBRA_METADATA_FORKNUM);
 }

-static UmbraFileContext *
-um_relation_filectx(SMgrRelation reln)
+static bool
+um_lblk_precedes_logical_eof_for_access(SMgrRelation reln, ForkNumber forknum,
+										const UmbraAccessState *access,
+										UmbraAccessLookupState *lookup_state,
+										BlockNumber lblkno)
 {
-	UmbraSmgrRelationState *state = reln->smgr_private;
+	BlockNumber logical_nblocks;

-	if (state == NULL)
-		return umfile_ctx_acquire(reln->smgr_rlocator);
+	if (!access->map_available)
+		return false;

-	if (state->filectx == NULL)
-		state->filectx = umfile_ctx_acquire(reln->smgr_rlocator);
+	if (!um_fork_uses_map_translation(forknum))
+		return false;
+
+	if (lookup_state != NULL && lookup_state->have_logical_nblocks)
+		logical_nblocks = lookup_state->logical_nblocks;
+	else
+	{
+		logical_nblocks = umnblocks_for_access(reln, forknum, access);
+		if (lookup_state != NULL)
+		{
+			lookup_state->logical_nblocks = logical_nblocks;
+			lookup_state->have_logical_nblocks = true;
+		}
+	}

-	return state->filectx;
+	return lblkno < logical_nblocks;
 }

 static bool
-um_tracks_identity_metadata(ForkNumber forknum)
+um_is_logical_unmaterialized_for_access(SMgrRelation reln, ForkNumber forknum,
+										const UmbraAccessState *access,
+										BlockNumber lblkno)
 {
-	return forknum == MAIN_FORKNUM ||
-		forknum == FSM_FORKNUM ||
-		forknum == VISIBILITYMAP_FORKNUM;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber pblkno;
+
+	if (!um_lblk_precedes_logical_eof_for_access(reln, forknum, access, NULL,
+												 lblkno))
+		return false;
+
+	return !MapTryLookup(ctx, reln->smgr_rlocator.locator,
+						 forknum, lblkno, &pblkno);
 }

 static void
-um_ensure_redo_metadata(SMgrRelation reln, ForkNumber forknum)
+um_reserve_fresh_pblkno_for_access(SMgrRelation reln, ForkNumber forknum,
+								   const UmbraAccessState *access,
+								   BlockNumber lblkno,
+								   BlockNumber *new_pblkno)
 {
-	Assert(reln != NULL);
+	UmbraFileContext *ctx = um_ctx_acquire(reln);

-	if (!InRecovery ||
-		RelFileLocatorBackendIsTemp(reln->smgr_rlocator) ||
-		!um_tracks_identity_metadata(forknum) ||
-		UmMetadataExists(reln))
+	Assert(new_pblkno != NULL);
+
+	if (!access->map_available)
+	{
+		*new_pblkno = lblkno;
 		return;
+	}

-	/*
-	 * Redo can materialize a new data fork via mdwritev()/mdextend() without a
-	 * preceding smgrcreate() callback, for example during CREATE DATABASE
-	 * WAL-log replay. Ensure metadata exists before MAP state is consulted or
-	 * checkpointed for that relation.
-	 */
-	elog(DEBUG1, "umbra redo ensure metadata %u/%u/%u fork=%d",
-		 reln->smgr_rlocator.locator.spcOid,
-		 reln->smgr_rlocator.locator.dbOid,
-		 reln->smgr_rlocator.locator.relNumber,
-		 forknum);
-	umcreaterelationmetadata(reln);
+	if (!MapReserveFreshPblkno(ctx, reln->smgr_rlocator.locator,
+							   forknum, lblkno, new_pblkno))
+		elog(ERROR,
+			 "failed to reserve fresh physical block for relation %u/%u/%u fork %d blk %u",
+			 reln->smgr_rlocator.locator.spcOid,
+			 reln->smgr_rlocator.locator.dbOid,
+			 reln->smgr_rlocator.locator.relNumber,
+				 forknum, lblkno);
 }

-static void
-um_identity_update_metadata(SMgrRelation reln, ForkNumber forknum,
-							BlockNumber nblocks, bool fork_exists)
+static UmbraAccessResolveResult
+um_resolve_lblk_for_read(SMgrRelation reln, ForkNumber forknum,
+						 const UmbraAccessState *access,
+						 UmbraAccessLookupState *lookup_state,
+						 BlockNumber lblkno)
 {
-	UmbraFileContext *ctx = um_relation_filectx(reln);
-	BlockNumber logical_nblocks;
+	Assert(lookup_state != NULL);

-	Assert(reln != NULL);
-	Assert(um_tracks_identity_metadata(forknum));
-	Assert(UmMetadataExists(reln));
+	if (!lookup_state->have_logical_nblocks)
+	{
+		lookup_state->logical_nblocks =
+			umnblocks_for_access(reln, forknum, access);
+		lookup_state->have_logical_nblocks = true;
+	}

-	if (!MapSBlockEnsureLoaded(ctx, reln->smgr_rlocator.locator))
-		elog(ERROR, "could not load MAP superblock for relation %u/%u/%u",
-			 reln->smgr_rlocator.locator.spcOid,
-			 reln->smgr_rlocator.locator.dbOid,
-			 reln->smgr_rlocator.locator.relNumber);
+	if (InRecovery && UmbraForkIsAuxiliaryMapped(forknum) &&
+		lookup_state->logical_nblocks != InvalidBlockNumber &&
+		um_is_stale_post_truncate_lblk_with_eof(forknum,
+												lookup_state->logical_nblocks,
+												lblkno))
+	{
+		return UMBRA_ACCESS_RESOLVED_ZERO;
+	}
+
+	if (lookup_state->logical_nblocks != InvalidBlockNumber &&
+		lblkno < lookup_state->logical_nblocks)
+	{
+		return UMBRA_ACCESS_RESOLVED_ZERO;
+	}
+
+	{
+		um_report_unmapped_map_entry(reln, forknum, access, lblkno);
+	}
+	pg_unreachable();
+}
+
+static UmbraAccessResolveResult
+um_resolve_lblk_for_write(SMgrRelation reln, ForkNumber forknum,
+						  const UmbraAccessState *access,
+						  UmbraAccessLookupState *lookup_state,
+						  BlockNumber lblkno, BlockNumber *pblkno)
+{
+	(void) lookup_state;
+	(void) pblkno;
+
+	um_report_unmapped_map_entry(reln, forknum, access, lblkno);
+	pg_unreachable();
+}
+
+static UmbraAccessResolveResult
+um_resolve_lblk_for_writeback(SMgrRelation reln, ForkNumber forknum,
+							  const UmbraAccessState *access,
+							  UmbraAccessLookupState *lookup_state,
+							  BlockNumber lblkno, BlockNumber *pblkno)
+{
+	if (um_lblk_precedes_logical_eof_for_access(reln, forknum, access,
+												lookup_state, lblkno))
+	{
+		if (MapInflightLookupOwnedPblk(reln->smgr_rlocator.locator,
+									   forknum, lblkno, pblkno))
+			return UMBRA_ACCESS_RESOLVED_PBLK;
+		return UMBRA_ACCESS_RESOLVED_SKIP;
+	}
+
+	if (um_is_stale_post_truncate_lblk_for_access(reln, forknum, access,
+												  lookup_state, lblkno))
+		return UMBRA_ACCESS_RESOLVED_SKIP;
+
+	um_report_unmapped_map_entry(reln, forknum, access, lblkno);
+	pg_unreachable();
+}
+
+static UmbraAccessResolveResult
+um_resolve_lblk_for_access(SMgrRelation reln, ForkNumber forknum,
+						   const UmbraAccessState *access,
+						   UmbraAccessLookupState *lookup_state,
+						   BlockNumber lblkno,
+						   UmbraAccessResolveMode mode,
+						   BlockNumber *pblkno)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	bool		found;
+
+	Assert(pblkno != NULL);
+
+	if (!access->map_available)
+	{
+		*pblkno = lblkno;
+		return UMBRA_ACCESS_RESOLVED_PBLK;
+	}

-	if (!fork_exists && forknum != MAIN_FORKNUM)
-		logical_nblocks = InvalidBlockNumber;
+	if (mode == UMBRA_ACCESS_RESOLVE_READ)
+	{
+		found = MapTryLookup(ctx, reln->smgr_rlocator.locator,
+							 forknum, lblkno, pblkno);
+	}
 	else
-		logical_nblocks = nblocks;
+	{
+		found = MapTryLookup(ctx, reln->smgr_rlocator.locator,
+							 forknum, lblkno, pblkno);
+	}
+
+	if (found)
+		return UMBRA_ACCESS_RESOLVED_PBLK;

-	MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
-							   forknum, logical_nblocks,
-							   InvalidXLogRecPtr);
+	switch (mode)
+	{
+			case UMBRA_ACCESS_RESOLVE_READ:
+				return um_resolve_lblk_for_read(reln, forknum, access,
+												 lookup_state, lblkno);
+			case UMBRA_ACCESS_RESOLVE_WRITE:
+				return um_resolve_lblk_for_write(reln, forknum, access,
+												  lookup_state, lblkno, pblkno);
+			case UMBRA_ACCESS_RESOLVE_WRITEBACK:
+				return um_resolve_lblk_for_writeback(reln, forknum, access,
+													  lookup_state, lblkno, pblkno);
+		}
+
+	pg_unreachable();
+}

-	if (fork_exists || forknum == MAIN_FORKNUM)
+/*
+ * Resolve the longest read prefix beginning at blocknum whose translated
+ * physical blocks form one contiguous run within a single segment.
+ */
+static BlockNumber
+um_resolve_mapped_read_run(SMgrRelation reln, ForkNumber forknum,
+						   const UmbraAccessState *access,
+						   UmbraAccessLookupState *lookup_state,
+						   BlockNumber blocknum, BlockNumber maxblocks,
+						   BlockNumber *start_pblk)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber	run_blocks;
+	BlockNumber	pblk;
+
+	Assert(access != NULL);
+	Assert(start_pblk != NULL);
+	Assert(maxblocks > 0);
+
+	run_blocks = MapTryLookupPblkRun(ctx, reln->smgr_rlocator.locator,
+									 forknum, blocknum, maxblocks,
+									 start_pblk);
+	if (run_blocks > 0)
+		return run_blocks;
+
+	if (um_resolve_lblk_for_access(reln, forknum, access, lookup_state,
+								   blocknum, UMBRA_ACCESS_RESOLVE_READ,
+								   &pblk) == UMBRA_ACCESS_RESOLVED_PBLK)
 	{
+		*start_pblk = pblk;
+		return 1;
+	}
+
+	return 0;
+}
+
+static UmbraMappedBirthResult
+um_publish_mapped_birth(SMgrRelation reln, ForkNumber forknum,
+						const UmbraAccessState *access,
+						BlockNumber lblkno, bool allow_wal_owned_firstborn)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	UmbraMappedBirthResult result;
+	BlockNumber old_pblkno;
+
+	Assert(access->map_available);
+	(void) allow_wal_owned_firstborn;
+
+	result.mapping_published = false;
+
+	MapGetNewPbkno(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
+				   &result.pblkno, &old_pblkno);
+	Assert(old_pblkno == InvalidBlockNumber);
+
+	MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
+				  result.pblkno, InvalidXLogRecPtr);
+	result.mapping_published = true;
+
+	if (result.mapping_published)
 		MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
-									   forknum, nblocks,
+									   forknum, result.pblkno + 1,
 									   InvalidXLogRecPtr);
-		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
-									 forknum, nblocks,
-									 InvalidXLogRecPtr);
-	}
+
+	MapInflightRelease(reln->smgr_rlocator.locator, forknum, lblkno);
+	return result;
 }

+/*
+ * Auxiliary mapped forks (FSM/VM) can observe stale logical blocks during
+ * replay after truncate/VACUUM maintenance has already shrunk the
+ * authoritative logical EOF.  Those callers historically expect EOF-like
+ * semantics, not a hard mapped-fork corruption error.
+ *
+ * Recovery reads are synchronous, but they still come through the AIO/bufmgr
+ * pipeline.  Complete the read locally as a zero page so the normal shared
+ * buffer completion callbacks still run and mark the buffer valid, without
+ * issuing any physical I/O against an unmapped/stale block.
+ */
 static void
-um_refresh_identity_metadata(SMgrRelation reln)
+um_complete_zero_readv(PgAioHandle *ioh, SMgrRelation reln,
+					   ForkNumber forknum, BlockNumber blocknum,
+					   void *buffer)
 {
-	ForkNumber	forknum;
+	Assert(ioh != NULL);
+	Assert(buffer != NULL);
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_my_backend->handed_out_io == ioh);

-	Assert(UmMetadataExists(reln));
+	memset(buffer, 0, BLCKSZ);
+
+	pgaio_io_set_target_smgr(ioh, reln, forknum,
+							 blocknum,
+							 blocknum,
+							 1, false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);

-	for (forknum = MAIN_FORKNUM; forknum <= VISIBILITYMAP_FORKNUM; forknum++)
+	/*
+	 * Mirror the minimal pgaio_io_stage() state transitions needed to invoke
+	 * the normal smgr + buffer read completion callbacks, but do not start an
+	 * actual readv against any file.
+	 */
+	ioh->op = PGAIO_OP_READV;
+	ioh->result = 0;
+	ioh->state = PGAIO_HS_DEFINED;
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	ioh->state = PGAIO_HS_STAGED;
+	pgaio_io_prepare_submit(ioh);
+
+	START_CRIT_SECTION();
+	pgaio_io_process_completion(ioh, BLCKSZ);
+	END_CRIT_SECTION();
+}
+
+/*
+ * PG18 truncation order is:
+ *   1. RelationTruncate() emits truncate WAL
+ *   2. smgrpretruncate() lets Umbra preload MAP pages before entering the
+ *      critical section
+ *   3. smgrtruncate() later drops old shared buffers and performs the
+ *      truncate-time metadata update
+ *
+ * That means a backend can briefly attempt to flush a stale dirty buffer for
+ * a block that has already been truncated away logically, but has not yet
+ * been removed from shared buffers.  Such a write must be ignored, not turned
+ * into a new mapping owner.  Only blocks still inside the authoritative
+ * logical EOF are allowed to demand a mapping.
+ */
+static bool
+um_is_stale_post_truncate_lblk_for_access(SMgrRelation reln,
+										  ForkNumber forknum,
+										  const UmbraAccessState *access,
+										  UmbraAccessLookupState *lookup_state,
+										  BlockNumber lblkno)
+{
+	BlockNumber logical_nblocks;
+
+	if (!access->map_available)
+		return false;
+
+	if (!um_fork_uses_map_translation(forknum))
+		return false;
+
+	/*
+	 * MAIN fork remains strict during recovery: missing mappings there are
+	 * corruption signals. Auxiliary mapped forks (FSM/VM) can legitimately
+	 * walk just beyond logical EOF during truncate/vacuum maintenance, both
+	 * in normal execution and during replay.
+	 */
+	if (InRecovery && !UmbraForkIsAuxiliaryMapped(forknum))
+		return false;
+
+	/*
+	 * Use the same authoritative logical EOF that the rest of the system sees.
+	 * Umbra should have only one logical-size source of truth; this stale
+	 * post-truncate path must not invent a second one by scanning MAP pages.
+	 */
+	if (lookup_state != NULL && lookup_state->have_logical_nblocks)
+		logical_nblocks = lookup_state->logical_nblocks;
+	else
 	{
-		bool		fork_exists;
-		BlockNumber nblocks;
+		logical_nblocks = umnblocks_for_access(reln, forknum, access);
+		if (lookup_state != NULL)
+		{
+			lookup_state->logical_nblocks = logical_nblocks;
+			lookup_state->have_logical_nblocks = true;
+		}
+	}

-		if (!um_tracks_identity_metadata(forknum))
-			continue;
+	return um_is_stale_post_truncate_lblk_with_eof(forknum,
+												   logical_nblocks,
+												   lblkno);
+}
+
+static bool
+um_is_stale_post_truncate_lblk_with_eof(ForkNumber forknum,
+										BlockNumber logical_nblocks,
+										BlockNumber lblkno)
+{
+	if (!um_fork_uses_map_translation(forknum))
+		return false;
+
+	if (logical_nblocks == InvalidBlockNumber)
+		return false;
+
+	return lblkno >= logical_nblocks;
+}
+
+BlockNumber
+umphysicalblock(SMgrRelation reln, ForkNumber forknum, BlockNumber lblkno)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber pblkno;
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+		return lblkno;
+
+	if (MapTryLookup(ctx, reln->smgr_rlocator.locator,
+					 forknum, lblkno, &pblkno))
+		return pblkno;
+
+	um_report_unmapped_map_entry(reln, forknum, &access, lblkno);
+	pg_unreachable();
+}
+
+void
+UmMapGetNewPbkno(SMgrRelation reln, ForkNumber forknum,
+				 BlockNumber lblkno, BlockNumber *new_pblkno,
+				 BlockNumber *old_pblkno)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	Assert(new_pblkno != NULL);
+	Assert(old_pblkno != NULL);
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+	{
+		*old_pblkno = lblkno;
+		*new_pblkno = lblkno;
+		return;
+	}
+
+	MapGetNewPbkno(ctx, reln->smgr_rlocator.locator, forknum,
+				   lblkno, new_pblkno, old_pblkno);
+}

-		fork_exists = mdexists(reln, forknum);
-		nblocks = fork_exists ? mdnblocks(reln, forknum) : 0;
-		um_identity_update_metadata(reln, forknum, nblocks, fork_exists);
+void
+UmMapReserveFreshPbkno(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber lblkno, BlockNumber *new_pblkno)
+{
+	UmbraAccessState access;
+
+	access = um_classify_access(reln, forknum);
+	um_reserve_fresh_pblkno_for_access(reln, forknum, &access,
+									   lblkno, new_pblkno);
+}
+
+bool
+UmMapAccessAvailable(SMgrRelation reln, ForkNumber forknum)
+{
+	UmbraAccessState access;
+
+	access = um_classify_access(reln, forknum);
+	return access.map_available;
+}
+
+bool
+UmMapTryLookupPblkno(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber lblkno, BlockNumber *pblkno)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+	{
+		*pblkno = lblkno;
+		return true;
 	}
+
+	return MapTryLookup(ctx, reln->smgr_rlocator.locator,
+						forknum, lblkno, pblkno);
+}
+
+bool
+UmMapIsLogicalUnmaterialized(SMgrRelation reln, ForkNumber forknum,
+							 BlockNumber lblkno)
+{
+	UmbraAccessState access;
+
+	access = um_classify_access(reln, forknum);
+	return um_is_logical_unmaterialized_for_access(reln, forknum, &access,
+												   lblkno);
+}
+
+void
+UmMapSetMapping(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber lblkno, BlockNumber new_pblkno,
+				XLogRecPtr map_lsn)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+		return;
+
+	MapSetMapping(ctx, reln->smgr_rlocator.locator,
+				  forknum, lblkno, new_pblkno, map_lsn);
 }

 static void
@@ -636,3 +1375,953 @@ um_filetag_path(const FileTag *ftag, char *path)
 		snprintf(path, MAXPGPATH, "%s.%llu",
 				 base.str, (unsigned long long) ftag->segno);
 }
+
+void
+umcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	bool		created = false;
+
+	if (!umfile_open_or_create(ctx, forknum, isRedo, &created))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create or open relation %u/%u/%u fork %d",
+						reln->smgr_rlocator.locator.spcOid,
+						reln->smgr_rlocator.locator.dbOid,
+						reln->smgr_rlocator.locator.relNumber,
+						forknum)));
+
+	if (created &&
+		UmbraForkIsAuxiliaryMapped(forknum) &&
+		UmMetadataExists(reln))
+	{
+		XLogRecPtr	map_lsn;
+
+		map_lsn = InRecovery ? GetXLogReplayRecPtr(NULL) : GetXLogWriteRecPtr();
+		MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+								   forknum, 0, map_lsn);
+	}
+}
+
+bool
+umexists(SMgrRelation reln, ForkNumber forknum)
+{
+	UmbraMapPolicy policy;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	if (!um_fork_uses_map_translation(forknum))
+		return umfile_exists(ctx, forknum, UMFILE_EXISTS_DENSE);
+
+	policy = um_map_policy_for_access(reln, forknum);
+	if (policy != UMBRA_MAP_POLICY_REQUIRE_MAP)
+		return umfile_exists(ctx, forknum, UMFILE_EXISTS_DENSE);
+
+	return um_mapped_exists_from_super(reln, forknum);
+}
+
+void
+umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+{
+	/*
+	 * Keep MAIN and MAP fork deletion ordered so the mapping lifecycle tracks
+	 * the data fork during DROP processing and relfilenode reuse.
+	 */
+	if (forknum == InvalidForkNumber)
+	{
+		MapInvalidateRelation(rlocator.locator);
+
+		umfile_unlink(rlocator, MAIN_FORKNUM, isRedo);
+		UmMetadataUnlink(rlocator, isRedo);
+
+		for (ForkNumber other = FSM_FORKNUM; other <= INIT_FORKNUM; other++)
+			umfile_unlink(rlocator, other, isRedo);
+		return;
+	}
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		MapInvalidateRelation(rlocator.locator);
+	umfile_unlink(rlocator, forknum, isRedo);
+}
+
+void
+umextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void *buffer, bool skipFsync)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	UmbraAccessState access;
+	BlockNumber	pblkno;
+	BlockNumber	logical_nblocks;
+	BlockNumber materialized_nblocks = 0;
+	BlockNumber	max_pblkno = InvalidBlockNumber;
+	bool		mapping_committed = false;
+
+	access = um_classify_access(reln, forknum);
+
+	/* Only MAP fork itself uses direct physical extend. */
+	if (!access.map_available)
+	{
+		pblkno = blocknum;
+		umfile_extend(ctx, forknum, pblkno, buffer, skipFsync);
+		return;
+	}
+
+	/*
+	 * smgrextend() contract allows blocknum beyond current EOF and requires
+	 * intervening space to read as zeros.  For mapped forks, we must explicitly
+	 * create mappings and materialize zero pages for the gap.
+	 */
+	logical_nblocks = umnblocks_for_access(reln, forknum, &access);
+	(void) MapSBlockTryGetPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+										  forknum, &materialized_nblocks);
+	if (blocknum > logical_nblocks)
+	{
+		umzeroextend(reln, forknum, logical_nblocks,
+					 (int) (blocknum - logical_nblocks),
+					 skipFsync);
+	}
+
+	/*
+	 * For mapped data forks, smgrextend is the point where an unmapped logical
+	 * block gets its first physical block. Existing mappings can appear during
+	 * redo/replay and should be reused.
+	 */
+	if (!MapTryLookup(ctx, reln->smgr_rlocator.locator, forknum, blocknum, &pblkno))
+	{
+		UmbraMappedBirthResult birth;
+
+		birth = um_publish_mapped_birth(reln, forknum, &access, blocknum, true);
+		pblkno = birth.pblkno;
+		mapping_committed = birth.mapping_published;
+
+		/*
+		 * Reservation/publication can make the mapping visible before the data
+		 * fork is physically materialized. Extend when the chosen pblk is still
+		 * beyond the materialized physical EOF; otherwise just write the page.
+		 */
+		/*
+		 * Page checksum stays keyed by the logical block identity; callers
+		 * reaching smgrextend() have already set it using the logical blkno.
+		 */
+		if (pblkno >= materialized_nblocks)
+		{
+			umfile_extend(ctx, forknum, pblkno, buffer, skipFsync);
+			max_pblkno = pblkno;
+		}
+		else
+		{
+			const void *single_buffer[1];
+
+			single_buffer[0] = buffer;
+			umfile_writev(ctx, forknum, pblkno, single_buffer, 1, skipFsync);
+		}
+	}
+	else
+	{
+		if (pblkno >= materialized_nblocks)
+		{
+			umfile_extend(ctx, forknum, pblkno, buffer, skipFsync);
+			max_pblkno = pblkno;
+		}
+		else
+		{
+			const void *single_buffer[1];
+
+			single_buffer[0] = buffer;
+			umfile_writev(ctx, forknum, pblkno, single_buffer, 1, skipFsync);
+		}
+	}
+
+	if (mapping_committed)
+		MapSBlockBumpLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+										forknum, blocknum + 1,
+										InvalidXLogRecPtr);
+	if (max_pblkno != InvalidBlockNumber)
+		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+										 forknum, max_pblkno + 1,
+										 InvalidXLogRecPtr);
+}
+
+void
+umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 int nblocks, bool skipFsync)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	UmbraAccessState access;
+	BlockNumber	run_start_pblk;
+	BlockNumber	max_pblkno;
+	int			run_len;
+
+	if (nblocks <= 0)
+		return;
+
+	access = um_classify_access(reln, forknum);
+
+	/* Direct physical path for non-mapped forks. */
+	if (!access.map_available)
+	{
+		umfile_zeroextend(ctx, forknum, blocknum, nblocks,
+						  skipFsync);
+		return;
+	}
+
+	/*
+	 * Per-block path for single-block, recovery, or callers that encountered
+	 * pre-existing/pending MAP ownership in the requested range.
+	 *
+	 * For mapped forks we must materialize each newly-mapped physical page as
+	 * a zero page, otherwise a later read through MAP would hit EOF/short read.
+	 *
+	 * Map allocator hands out sequential pblknos, so we can batch contiguous
+	 * physical ranges with umfile_zeroextend().
+	 */
+	run_start_pblk = InvalidBlockNumber;
+	max_pblkno = InvalidBlockNumber;
+	run_len = 0;
+
+	for (int i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = blocknum + (BlockNumber) i;
+		BlockNumber pblk;
+
+		if (!MapTryLookup(ctx, reln->smgr_rlocator.locator, forknum, lblk, &pblk))
+		{
+			UmbraMappedBirthResult birth;
+
+			birth = um_publish_mapped_birth(reln, forknum, &access, lblk, false);
+			pblk = birth.pblkno;
+		}
+
+		if (max_pblkno == InvalidBlockNumber || pblk > max_pblkno)
+			max_pblkno = pblk;
+
+		if (run_len == 0)
+		{
+			run_start_pblk = pblk;
+			run_len = 1;
+		}
+		else if (pblk == run_start_pblk + (BlockNumber) run_len)
+		{
+			run_len++;
+		}
+		else
+		{
+			umfile_zeroextend(ctx, forknum, run_start_pblk,
+							  run_len, skipFsync);
+			run_start_pblk = pblk;
+			run_len = 1;
+		}
+	}
+
+	if (run_len > 0)
+		umfile_zeroextend(ctx, forknum, run_start_pblk, run_len,
+						  skipFsync);
+
+	if (max_pblkno != InvalidBlockNumber)
+		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+									 forknum, max_pblkno + 1,
+									 InvalidXLogRecPtr);
+
+	MapSBlockBumpLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+								forknum, blocknum + (BlockNumber) nblocks,
+								InvalidXLogRecPtr);
+}
+
+bool
+umprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	access = um_classify_access(reln, forknum);
+
+	if (!access.map_available)
+		return umfile_prefetch(ctx, forknum, blocknum, nblocks);
+
+	for (int i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = blocknum + (BlockNumber) i;
+		BlockNumber pblk;
+
+		if (um_resolve_lblk_for_access(reln, forknum, &access, &lookup_state,
+									   lblk, UMBRA_ACCESS_RESOLVE_READ,
+									   &pblk) != UMBRA_ACCESS_RESOLVED_PBLK)
+			continue;
+
+		if (!umfile_prefetch(ctx, forknum, pblk, 1))
+			return false;
+	}
+	return true;
+}
+
+uint32
+ummaxcombine(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	BlockNumber	pblk;
+	BlockNumber	run_blocks;
+
+	access = um_classify_access(reln, forknum);
+
+	/*
+	 * For mapped forks we can only combine a read while the translated physical
+	 * blocks remain contiguous and stay inside one segment.
+	 */
+	if (access.map_available)
+	{
+		if (InRecovery && UmbraForkIsAuxiliaryMapped(forknum))
+			return 1;
+
+		run_blocks = um_resolve_mapped_read_run(reln, forknum, &access,
+												 &lookup_state, blocknum,
+												 Min((BlockNumber) io_max_combine_limit,
+													 (BlockNumber) umfile_maxcombine(forknum,
+																			 blocknum)),
+												 &pblk);
+		return Max((BlockNumber) 1, run_blocks);
+	}
+	return umfile_maxcombine(forknum, blocknum);
+}
+
+void
+umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	bool		aux_recovery_read;
+
+	access = um_classify_access(reln, forknum);
+
+	if (!access.map_available)
+	{
+		umfile_readv(ctx, forknum, blocknum, buffers, nblocks);
+		return;
+	}
+
+	aux_recovery_read = InRecovery && UmbraForkIsAuxiliaryMapped(forknum);
+
+	for (BlockNumber i = 0; i < nblocks;)
+	{
+		BlockNumber	pblk;
+		BlockNumber	run_blocks;
+
+		run_blocks = um_resolve_mapped_read_run(reln, forknum, &access,
+												 &lookup_state, blocknum + i,
+												 aux_recovery_read ? 1 :
+												 (nblocks - i),
+												 &pblk);
+		if (run_blocks == 0)
+		{
+			memset(buffers[i], 0, BLCKSZ);
+			i++;
+			continue;
+		}
+
+		/*
+		 * Preserve md-style sync-read semantics for mapped forks by routing the
+		 * translated physical block through umfile_readv().
+		 *
+		 * Callers such as VM/FSM redo use RBM_ZERO_ON_ERROR and expect
+		 * InRecovery/zero_damaged_pages handling on short reads.  A direct
+		 * physical read would bypass that behavior and fail before bufmgr gets a
+		 * chance to zero the page.
+		 */
+		if (aux_recovery_read)
+			um_ensure_datafork_batch_ready_for_access(reln, forknum, &access,
+													  pblk, true /* skipFsync */ );
+
+		umfile_readv(ctx, forknum, pblk, &buffers[i], run_blocks);
+		i += run_blocks;
+	}
+}
+
+static void
+um_startreadv_direct_physical(PgAioHandle *ioh, SMgrRelation reln,
+							  UmbraFileContext *ctx, ForkNumber forknum,
+							  BlockNumber blocknum, void **buffers,
+							  BlockNumber nblocks)
+{
+	pgaio_io_set_target_smgr(ioh, reln, forknum,
+							 blocknum /* logical */,
+							 blocknum /* physical */,
+							 nblocks, false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+	umfile_startreadv(ioh, ctx, forknum, blocknum, buffers, nblocks);
+}
+
+static BlockNumber
+um_startreadv_lookup_mapped(PgAioHandle *ioh, SMgrRelation reln,
+							ForkNumber forknum, BlockNumber blocknum,
+							void **buffers, BlockNumber nblocks,
+							const UmbraAccessState *access,
+							UmbraAccessLookupState *lookup_state,
+							bool aux_recovery_read,
+							BlockNumber *pblk)
+{
+	BlockNumber	run_blocks;
+
+	Assert(pblk != NULL);
+
+	run_blocks = um_resolve_mapped_read_run(reln, forknum, access, lookup_state,
+											 blocknum,
+											 aux_recovery_read ? 1 : nblocks,
+											 pblk);
+	if (run_blocks > 0)
+		return run_blocks;
+
+	ioh->handle_data_len = 1;
+	um_complete_zero_readv(ioh, reln, forknum, blocknum, buffers[0]);
+	return 0;
+}
+
+static void
+um_startreadv_mapped_physical(PgAioHandle *ioh, SMgrRelation reln,
+							  UmbraFileContext *ctx, ForkNumber forknum,
+							  BlockNumber blocknum, void **buffers,
+							  BlockNumber nblocks, BlockNumber pblk,
+							  bool aux_recovery_read,
+							  const UmbraAccessState *access)
+{
+	if (aux_recovery_read)
+	{
+		uint64		ensured_bytes = 0;
+
+		if (!umfile_ctx_block_exists(ctx, forknum, pblk))
+		{
+			um_ensure_datafork_batch_ready_for_access(reln, forknum, access,
+													  pblk, true /* skipFsync */ );
+			ensured_bytes = BLCKSZ;
+		}
+		(void) ensured_bytes;
+	}
+
+	pgaio_io_set_target_smgr(ioh, reln, forknum,
+							 blocknum /* logical */,
+							 pblk /* physical */,
+							 nblocks, false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+	umfile_startreadv_physical(ioh, ctx, forknum,
+							   blocknum /* logical */,
+							   pblk /* physical */,
+							   buffers, nblocks);
+}
+
+void
+umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, void **buffers, BlockNumber nblocks)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber	pblk;
+	bool		aux_recovery_read;
+
+	access = um_classify_access(reln, forknum);
+
+	if (!access.map_available)
+	{
+		um_startreadv_direct_physical(ioh, reln, ctx, forknum,
+									  blocknum, buffers, nblocks);
+		return;
+	}
+
+	/*
+	 * See ummaxcombine(): callers may only combine a prefix whose translated
+	 * physical blocks form one contiguous run.  If the mapping changed since
+	 * that check, shrink this I/O to the currently valid prefix and let
+	 * WaitReadBuffers() retry the remainder.
+	 */
+	aux_recovery_read = InRecovery && UmbraForkIsAuxiliaryMapped(forknum);
+
+	{
+		BlockNumber	run_blocks;
+
+		run_blocks = um_startreadv_lookup_mapped(ioh, reln, forknum, blocknum,
+												 buffers, nblocks, &access,
+												 &lookup_state,
+												 aux_recovery_read, &pblk);
+		if (run_blocks == 0)
+			return;
+
+		if (run_blocks < nblocks)
+			ioh->handle_data_len = run_blocks;
+		nblocks = run_blocks;
+	}
+
+	/*
+	 * Start I/O using physical addressing but preserve logical identity for
+	 * error reporting and reopen semantics.
+	 */
+	um_startreadv_mapped_physical(ioh, reln, ctx, forknum, blocknum,
+								  buffers, nblocks, pblk,
+								  aux_recovery_read, &access);
+}
+
+static void
+um_claim_write_barrier(SMgrRelation reln, ForkNumber forknum,
+					   UmbraFileContext *ctx, BlockNumber lblkno,
+					   MapInflightBarrier *barrier)
+{
+	int			wait_retries = 0;
+
+	Assert(barrier != NULL);
+	Assert(!barrier->valid);
+
+	for (;;)
+	{
+		if (MapInflightTryClaimBarrier(ctx, reln->smgr_rlocator.locator,
+									   forknum, lblkno, barrier))
+		{
+			if (wait_retries > 0)
+				elog(LOG,
+					 "storage write waited for in-flight remap on relation %u/%u/%u fork %d block %u (%d retries, %d usec)",
+					 reln->smgr_rlocator.locator.spcOid,
+					 reln->smgr_rlocator.locator.dbOid,
+					 reln->smgr_rlocator.locator.relNumber,
+					 forknum, lblkno,
+					 wait_retries,
+					 wait_retries * UMBRA_WRITE_BARRIER_WAIT_USEC);
+			return;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(UMBRA_WRITE_BARRIER_WAIT_USEC);
+		wait_retries++;
+
+		if (wait_retries >= UMBRA_WRITE_BARRIER_WAIT_RETRIES)
+			ereport(ERROR,
+					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+					 errmsg("timed out waiting for in-flight remap of relation %u/%u/%u fork %d block %u",
+							reln->smgr_rlocator.locator.spcOid,
+							reln->smgr_rlocator.locator.dbOid,
+							reln->smgr_rlocator.locator.relNumber,
+							forknum, lblkno)));
+	}
+}
+
+static void
+um_release_write_barriers(MapInflightBarrier *barriers, BlockNumber nbarriers)
+{
+	for (BlockNumber i = 0; i < nbarriers; i++)
+		MapInflightReleaseBarrier(&barriers[i]);
+}
+
+static void
+um_flush_write_barrier_run(UmbraFileContext *ctx, ForkNumber forknum,
+						   BlockNumber run_start_pblk,
+						   const void **buffers,
+						   BlockNumber run_start_idx,
+						   BlockNumber *run_blocks,
+						   bool skipFsync,
+						   MapInflightBarrier *barriers)
+{
+	if (*run_blocks == 0)
+		return;
+
+	umfile_writev(ctx, forknum, run_start_pblk, &buffers[run_start_idx],
+				  *run_blocks, skipFsync);
+	um_release_write_barriers(barriers, *run_blocks);
+	*run_blocks = 0;
+}
+
+void
+umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber		materialized_nblocks = 0;
+	BlockNumber		max_extended_pblk = InvalidBlockNumber;
+	BlockNumber		run_start_pblk = InvalidBlockNumber;
+	BlockNumber		run_start_idx = 0;
+	BlockNumber		run_blocks = 0;
+	MapInflightBarrier run_barriers[PG_IOV_MAX];
+	MapInflightBarrier pending_barrier = {0};
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+	{
+		umfile_writev(ctx, forknum, blocknum, buffers, nblocks,
+					  skipFsync);
+		return;
+	}
+
+	(void) MapSBlockTryGetPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+										  forknum, &materialized_nblocks);
+
+	PG_TRY();
+	{
+		for (BlockNumber i = 0; i < nblocks; i++)
+		{
+			BlockNumber lblk = blocknum + i;
+			BlockNumber pblk;
+			UmbraAccessResolveResult resolve;
+			bool		can_extend_run;
+
+			pending_barrier.valid = false;
+			pending_barrier.slot_id = -1;
+			pending_barrier.entry_idx = -1;
+
+			/*
+			 * The barrier serializes physical writes with concurrent remap publication for
+			 * the same logical block.  Claim before lookup so a later relocation
+			 * cannot publish a new mapping while this write is still targeting the
+			 * old physical page.
+			 */
+			um_claim_write_barrier(reln, forknum, ctx, lblk, &pending_barrier);
+
+			resolve = um_resolve_lblk_for_access(reln, forknum, &access,
+												 &lookup_state,
+												 lblk,
+												 UMBRA_ACCESS_RESOLVE_WRITE,
+												 &pblk);
+			Assert(resolve == UMBRA_ACCESS_RESOLVED_PBLK);
+
+			/*
+			 * Checksum identity stays logical (lblk); callers reaching smgrwritev()
+			 * have already set it before Umbra translates to physical blocks.
+			 */
+
+			can_extend_run =
+				(run_blocks > 0) &&
+				(pblk == run_start_pblk + run_blocks) &&
+				(run_blocks < (BlockNumber) lengthof(run_barriers)) &&
+				((run_start_pblk % ((BlockNumber) RELSEG_SIZE)) + run_blocks <
+				 ((BlockNumber) RELSEG_SIZE));
+
+			if (pblk < materialized_nblocks)
+			{
+				if (run_blocks == 0)
+				{
+					run_start_pblk = pblk;
+					run_start_idx = i;
+				}
+				else if (!can_extend_run)
+				{
+					um_flush_write_barrier_run(ctx, forknum, run_start_pblk,
+											   buffers, run_start_idx,
+											   &run_blocks, skipFsync,
+											   run_barriers);
+					run_start_pblk = pblk;
+					run_start_idx = i;
+				}
+
+				Assert(run_blocks < (BlockNumber) lengthof(run_barriers));
+				run_barriers[run_blocks] = pending_barrier;
+				pending_barrier.valid = false;
+				run_blocks++;
+				continue;
+			}
+
+			um_flush_write_barrier_run(ctx, forknum, run_start_pblk,
+									   buffers, run_start_idx,
+									   &run_blocks, skipFsync, run_barriers);
+			run_start_pblk = InvalidBlockNumber;
+
+			umfile_extend(ctx, forknum, pblk, buffers[i], skipFsync);
+			MapInflightReleaseBarrier(&pending_barrier);
+
+			if (max_extended_pblk == InvalidBlockNumber || pblk > max_extended_pblk)
+				max_extended_pblk = pblk;
+			if (materialized_nblocks < pblk + 1)
+				materialized_nblocks = pblk + 1;
+		}
+
+		um_flush_write_barrier_run(ctx, forknum, run_start_pblk,
+								   buffers, run_start_idx,
+								   &run_blocks, skipFsync, run_barriers);
+	}
+	PG_CATCH();
+	{
+		MapInflightReleaseBarrier(&pending_barrier);
+		um_release_write_barriers(run_barriers, run_blocks);
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	if (max_extended_pblk != InvalidBlockNumber)
+		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
+									 forknum, max_extended_pblk + 1,
+									 InvalidXLogRecPtr);
+
+	if (nblocks > 0)
+	{
+		BlockNumber logical_nblocks;
+
+		logical_nblocks = umnblocks_for_access(reln, forknum, &access);
+		Assert(logical_nblocks >= blocknum + nblocks);
+	}
+}
+
+void
+umwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			BlockNumber nblocks)
+{
+	UmbraAccessState access;
+	UmbraAccessLookupState lookup_state = {0};
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber		run_start_pblk = InvalidBlockNumber;
+	BlockNumber		run_blocks = 0;
+
+	access = um_classify_access(reln, forknum);
+	if (!access.map_available)
+	{
+		umfile_writeback(ctx, forknum, blocknum, nblocks);
+		return;
+	}
+
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = blocknum + i;
+		BlockNumber pblk;
+		UmbraAccessResolveResult resolve;
+		bool		can_extend_run;
+
+		resolve = um_resolve_lblk_for_access(reln, forknum, &access,
+											 &lookup_state,
+											 lblk,
+											 UMBRA_ACCESS_RESOLVE_WRITEBACK,
+											 &pblk);
+		if (resolve == UMBRA_ACCESS_RESOLVED_SKIP)
+		{
+			if (run_blocks > 0)
+			{
+				umfile_writeback(ctx, forknum, run_start_pblk, run_blocks);
+				run_start_pblk = InvalidBlockNumber;
+				run_blocks = 0;
+			}
+			continue;
+		}
+
+		Assert(resolve == UMBRA_ACCESS_RESOLVED_PBLK);
+
+		can_extend_run =
+			(run_blocks > 0) &&
+			(pblk == run_start_pblk + run_blocks);
+
+		if (run_blocks == 0)
+		{
+			run_start_pblk = pblk;
+			run_blocks = 1;
+		}
+		else if (can_extend_run)
+		{
+			run_blocks++;
+		}
+		else
+		{
+			umfile_writeback(ctx, forknum, run_start_pblk, run_blocks);
+			run_start_pblk = pblk;
+			run_blocks = 1;
+		}
+	}
+
+	if (run_blocks > 0)
+		umfile_writeback(ctx, forknum, run_start_pblk, run_blocks);
+}
+
+BlockNumber
+umnblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	UmbraAccessState access;
+
+	access = um_classify_access(reln, forknum);
+	return umnblocks_for_access(reln, forknum, &access);
+}
+
+BlockNumber
+umnblocks_cached(SMgrRelation reln, ForkNumber forknum)
+{
+	return reln->smgr_cached_nblocks[forknum];
+}
+
+static BlockNumber
+umnblocks_for_access(SMgrRelation reln, ForkNumber forknum,
+					 const UmbraAccessState *access)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber nblocks;
+
+	if (!access->map_available)
+		return umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+
+	if (MapSBlockTryGetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+									  forknum, &nblocks))
+		return nblocks;
+
+	ereport(ERROR,
+			(errcode(ERRCODE_DATA_CORRUPTED),
+			 errmsg("missing or invalid MAP superblock for relation %u/%u/%u fork %d",
+					reln->smgr_rlocator.locator.spcOid,
+					reln->smgr_rlocator.locator.dbOid,
+					reln->smgr_rlocator.locator.relNumber,
+					forknum)));
+}
+
+void
+umpretruncate(SMgrRelation reln, ForkNumber forknum,
+			  BlockNumber old_blocks, BlockNumber nblocks,
+			  XLogRecPtr truncate_lsn)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	(void) old_blocks;
+	(void) truncate_lsn;
+	access = um_classify_access(reln, forknum);
+
+	if (um_fork_uses_map_translation(forknum) &&
+		(access.policy == UMBRA_MAP_POLICY_REQUIRE_MAP ||
+		 access.map_available))
+		MapPreloadTruncatePages(ctx, reln->smgr_rlocator.locator,
+								forknum, nblocks);
+}
+
+void
+umtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber old_blocks, BlockNumber nblocks)
+{
+	UmbraAccessState access;
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	access = um_classify_access(reln, forknum);
+	if (um_fork_uses_map_translation(forknum) &&
+		(access.policy == UMBRA_MAP_POLICY_REQUIRE_MAP ||
+		 access.map_available))
+	{
+		XLogRecPtr	map_lsn;
+
+		map_lsn = InRecovery ?
+			GetXLogReplayRecPtr(NULL) : GetXLogWriteRecPtr();
+
+		MapTruncate(ctx, reln->smgr_rlocator.locator,
+					forknum, nblocks, map_lsn);
+		MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
+								   forknum, nblocks, map_lsn);
+		MapReleasePreloadedTruncatePages(reln->smgr_rlocator.locator, forknum);
+		return;
+	}
+
+	/* Non-mapped forks (and MAP fork itself) truncate physically. */
+	umfile_truncate(ctx, forknum, old_blocks, nblocks);
+}
+
+void
+umimmedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		MapCheckpointRelation(reln->smgr_rlocator.locator);
+
+	umfile_immedsync(ctx, forknum);
+}
+
+void
+umregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	umimmedsync(reln, forknum);
+}
+
+bool
+umpreparependingsync(SMgrRelation reln)
+{
+	if (RelFileLocatorSkippingWAL(reln->smgr_rlocator.locator))
+		UmRebuildMapAndSuperblockForSkipWAL(reln);
+
+	return um_relation_requires_durable_sync(reln);
+}
+
+bool
+umneedsrecoveryfsmvacuum(SMgrRelation reln)
+{
+	(void) reln;
+
+	/*
+	 * FSM is not WAL-logged. During replay, Umbra already publishes the
+	 * truncate result through mapped metadata, and auxiliary stale reads are
+	 * handled with EOF-like semantics. Re-running the generic post-truncate
+	 * FSM vacuum step provides only tidy-up value, while forcing recovery to
+	 * walk stale upper-tree pages through the MAIN-oriented buffer model.
+	 *
+	 * Skip that replay-only cleanup and let later foreground FSM maintenance
+	 * refresh upper-level slots naturally.
+	 */
+	return false;
+}
+
+int
+umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	/*
+	 * smgrfd() is only used by the AIO reopen path, after the issuer has
+	 * already resolved logical identity to a concrete physical target.
+	 * Interpret blocknum here as a physical block number and reopen the
+	 * corresponding segment/offset directly.
+	 */
+	return umfile_fd(um_ctx_acquire(reln), forknum, blocknum, off);
+}
+
+int
+umsyncfiletag(const FileTag *ftag, char *path)
+{
+	File		fd;
+	int			ret;
+	int			save_errno;
+
+	um_filetag_path(ftag, path);
+
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		return -1;
+
+	ret = FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC);
+	save_errno = errno;
+
+	FileClose(fd);
+	errno = save_errno;
+	return ret;
+}
+
+int
+umunlinkfiletag(const FileTag *ftag, char *path)
+{
+	um_filetag_path(ftag, path);
+	return unlink(path);
+}
+
+bool
+umfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+{
+	/*
+	 * Database-scope filter (DROP DATABASE / MOVE DATABASE paths).
+	 */
+	if (ftag->forknum == InvalidForkNumber &&
+		ftag->segno == InvalidBlockNumber &&
+		ftag->rlocator.spcOid == 0 &&
+		ftag->rlocator.relNumber == 0)
+		return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
+
+	/*
+	 * Relation-scope filter: wildcard fork/segment.
+	 */
+	if (ftag->forknum == InvalidForkNumber &&
+		ftag->segno == InvalidBlockNumber)
+		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator);
+
+	/*
+	 * Fork-scope filter: wildcard segment.
+	 */
+	if (ftag->segno == InvalidBlockNumber)
+		return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
+			ftag->forknum == candidate->forknum;
+
+	/* Exact file match. */
+	return RelFileLocatorEquals(ftag->rlocator, candidate->rlocator) &&
+		ftag->forknum == candidate->forknum &&
+		ftag->segno == candidate->segno;
+}
diff --git a/src/backend/storage/smgr/umfile.c b/src/backend/storage/smgr/umfile.c
index 17145405cf..63afc8546c 100644
--- a/src/backend/storage/smgr/umfile.c
+++ b/src/backend/storage/smgr/umfile.c
@@ -1620,7 +1620,6 @@ umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknu
 		nblocks -= numblocks;
 		blocknum += numblocks;
 	}
-
 }

bool
@@ -1832,7 +1831,6 @@ umfile_readv(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
}
-
}

 void
@@ -1897,12 +1895,10 @@ umfile_startreadv_physical(PgAioHandle *ioh, UmbraFileContext *ctx,
 	 * Umbra MAP translation enforces single-block I/O via ummaxcombine().
 	 */
 	Assert(nblocks >= 1);
-	{
-		v = umfile_getseg(ctx, ctx->rlocator,
-						  forknum, physical_blocknum, false /* skipFsync */,
-						  UM_EXTENSION_FAIL,
-						  RelFileLocatorBackendIsTemp(ctx->rlocator));
-	}
+	v = umfile_getseg(ctx, ctx->rlocator,
+					  forknum, physical_blocknum, false /* skipFsync */,
+					  UM_EXTENSION_FAIL,
+					  RelFileLocatorBackendIsTemp(ctx->rlocator));

 	seekpos = (off_t) BLCKSZ * (physical_blocknum % ((BlockNumber) RELSEG_SIZE));
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
@@ -1926,10 +1922,8 @@ umfile_startreadv_physical(PgAioHandle *ioh, UmbraFileContext *ctx,
 	 * Preserve logical identity for AIO completion reporting and reopen.
 	 * The started I/O uses physical addressing (file/seekpos).
 	 */
-	{
-		ret = FileStartReadV(ioh, v->umfd_vfd, iovcnt, seekpos,
-							 WAIT_EVENT_DATA_FILE_READ);
-	}
+	ret = FileStartReadV(ioh, v->umfd_vfd, iovcnt, seekpos,
+						 WAIT_EVENT_DATA_FILE_READ);
 	if (ret != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2026,7 +2020,6 @@ umfile_writev(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
 		buffers += nblocks_this_segment;
 		blocknum += nblocks_this_segment;
 	}
-
 }

 void
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index e19f0d3e51..1b9cc0ecf3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3873,7 +3873,17 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 		SMgrRelation srel;

 		srel = RelationCreateStorage(newrlocator, persistence, true);
-		smgrclose(srel);
+
+		/*
+		 * Keep the newly created storage handle as the relation's default
+		 * smgr binding. The creator has already seeded the correct MAP policy
+		 * on this handle, and subsequent writes in the same command should not
+		 * fall back to shape-based reopening.
+		 */
+		if (relation->rd_smgr != NULL)
+			RelationCloseSmgr(relation);
+		relation->rd_smgr = srel;
+		smgrpin(srel);
 	}
 	else
 	{
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index 17b59aeed7..d07911673c 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -63,7 +63,8 @@ typedef union PgAioTargetData
 	struct
 	{
 		RelFileLocator rlocator;	/* physical relation identifier */
-		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber blockNum;	/* logical blknum relative to begin of reln */
+		BlockNumber physBlockNum;	/* physical blknum for executing this IO */
 		BlockNumber nblocks;
 		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
 		bool		is_temp:1;	/* proc can be inferred by owning AIO */
diff --git a/src/include/storage/map.h b/src/include/storage/map.h
index b4f6063f35..ccbc392835 100644
--- a/src/include/storage/map.h
+++ b/src/include/storage/map.h
@@ -77,6 +77,9 @@ typedef struct MapPage
 	uint32		pblknos[MAP_ENTRIES_PER_PAGE];
 } MapPage;

+#define MAP_PENDING_BITS_PER_WORD 64
+#define MAP_PENDING_BITMAP_WORDS \
+	((MAP_ENTRIES_PER_PAGE + MAP_PENDING_BITS_PER_WORD - 1) / MAP_PENDING_BITS_PER_WORD)

 /* Shared memory control structure */
 typedef struct MapSharedData
@@ -109,12 +112,20 @@ typedef struct MapBufferDesc
 	XLogRecPtr	page_lsn;	/* LSN of last modification */
 	int			id;			/* slot ID */
 	pg_atomic_uint32 state; /* state flags */
+	uint32		pending_count;	/* in-flight remaps protected by pending_bits */
+	uint64		pending_bits[MAP_PENDING_BITMAP_WORDS];
 	int			freeNext;	/* next buffer in free list */
 	int			wait_backend_pid;	/* backend PID of pin-count waiter */
 	LWLock		buffer_lock;		/* lock for buffer content access */
 	LWLock		io_in_progress_lock; /* lock for buffer I/O state */
 } MapBufferDesc;

+typedef struct MapInflightBarrier
+{
+	bool		valid;
+	int			slot_id;
+	int			entry_idx;
+} MapInflightBarrier;

 extern void MapBackendInit(void);
 extern const ShmemCallbacks MapShmemCallbacks;
@@ -128,10 +139,34 @@ extern BlockNumber MapTryLookupPblkRun(UmbraFileContext *map_ctx,
 									   ForkNumber forknum,
 									   BlockNumber lblkno,
 									   BlockNumber maxblocks,
-									   BlockNumber *start_pblkno);/* Buffer management */
+									   BlockNumber *start_pblkno);
+extern bool MapReserveFreshPblkno(UmbraFileContext *map_ctx,
+								  RelFileLocator rnode,
+								  ForkNumber forknum,
+								  BlockNumber lblkno,
+								  BlockNumber *new_pblkno);
+extern bool MapInflightLookupOwnedPblk(RelFileLocator rnode, ForkNumber forknum,
+									   BlockNumber lblkno, BlockNumber *pblkno);
+extern bool MapInflightTryClaimBarrier(UmbraFileContext *map_ctx,
+									   RelFileLocator rnode,
+									   ForkNumber forknum,
+									   BlockNumber lblkno,
+									   MapInflightBarrier *barrier);
+extern void MapInflightReleaseBarrier(MapInflightBarrier *barrier);
+extern void MapInflightRelease(RelFileLocator rnode, ForkNumber forknum,
+							   BlockNumber lblkno);
+/* Buffer management used by direct mapping publication helpers. */
 extern int	MapReadBuffer(UmbraFileContext *map_ctx, RelFileLocator rnode,
 						  ForkNumber forknum, BlockNumber map_blkno);

+/* Mapping publication helpers. */
+extern void MapGetNewPbkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						   ForkNumber forknum, BlockNumber lblkno,
+						   BlockNumber *new_pblkno, BlockNumber *old_pblkno);
+extern void MapSetMapping(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						  ForkNumber forknum, BlockNumber lblkno,
+						  BlockNumber new_pblkno, XLogRecPtr map_lsn);
+
 /* MAP superblock helpers */
 extern void MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode,
 						  XLogRecPtr map_lsn);
diff --git a/src/include/storage/map_internal.h b/src/include/storage/map_internal.h
index 8a2ee89deb..acac29b018 100644
--- a/src/include/storage/map_internal.h
+++ b/src/include/storage/map_internal.h
@@ -21,8 +21,31 @@ extern bool MapStartBufferIO(MapBufferDesc *buf, uint32 required_bits);
 extern void MapTerminateBufferIO(MapBufferDesc *buf, bool clear_dirty,
 								 uint32 set_flag_bits);
 extern void MapFlushBuffer(int slot_id);
+extern void MapInflightCleanupOwned(void);
+extern void MapInflightBackendInit(void);
 extern void MapResetAllTruncatePreloads(void);
 extern BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
 											  BlockNumber fork_page_idx);
 extern BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
+extern bool MapReserveNextPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
+								 ForkNumber forknum, BlockNumber lblkno,
+								 BlockNumber *new_pblkno, bool nowait);
+extern bool MapTryReserveFreshPblkno(UmbraFileContext *map_ctx,
+									 RelFileLocator rnode,
+									 ForkNumber forknum,
+									 BlockNumber lblkno,
+									 BlockNumber *new_pblkno,
+									 bool nowait);
+extern bool MapInflightTryClaim(UmbraFileContext *map_ctx,
+								RelFileLocator rnode,
+								ForkNumber forknum,
+								BlockNumber lblkno);
+extern void MapInflightFinishClaim(RelFileLocator rnode,
+								   ForkNumber forknum,
+								   BlockNumber lblkno,
+								   BlockNumber pblkno);
+extern bool MapInflightBitIsSet(RelFileLocator rnode,
+								ForkNumber forknum,
+								BlockNumber lblkno);
+
 #endif							/* MAP_INTERNAL_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 47dbf12643..b7f95ed5d3 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -14,6 +14,7 @@
 #ifndef SMGR_H
 #define SMGR_H

+#include "access/xlogdefs.h"
#include "lib/ilist.h"
#include "storage/aio_types.h"
#include "storage/block.h"
@@ -21,11 +22,11 @@

 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
- * cached storage-manager handles for a relation.  An SMgrRelation is created
- * (if not already present) by smgropen(), and destroyed by smgrdestroy().
- * Note that neither of these operations imply I/O, they just create or destroy
- * a hashtable entry.  (But smgrdestroy() may release associated resources,
- * such as OS-level file descriptors.)
+ * cached file handles.  An SMgrRelation is created (if not already present)
+ * by smgropen(), and destroyed by smgrdestroy().  Note that neither of these
+ * operations imply I/O, they just create or destroy a hashtable entry.  (But
+ * smgrdestroy() may release associated resources, such as OS-level file
+ * descriptors.)
  *
  * An SMgrRelation may be "pinned", to prevent it from being destroyed while
  * it's in use.  We use this to prevent pointers in relcache to smgr from being
@@ -113,18 +114,34 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
+extern void smgrbumpcachednblocks(SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber nblocks);
+extern bool smgrisinternalfork(ForkNumber forknum);
 extern void smgrcreaterelationmetadata(SMgrRelation reln);
 extern void smgrcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
 									 char relpersistence);
 extern void smgrsyncrelationmetadata(SMgrRelation reln);
 extern void smgrunlinkrelationmetadata(RelFileLocatorBackend rlocator,
 									   bool isRedo);
+extern void smgrsetmapstate(SMgrRelation reln, uint8 map_state);
 extern bool smgrcreatedballowswallog(void);
+extern void smgrinitnewrelation(SMgrRelation reln, bool needs_wal);
+extern void smgrredocreatefork(SMgrRelation reln, ForkNumber forknum,
+							   XLogRecPtr lsn);
 extern void smgrcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
 											  const Oid *tablespace_ids);
 extern void smgrinvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
 											  const Oid *tablespace_ids);
 extern void smgrinvalidatedatabase(Oid dbid);
+extern void smgrregistershutdowncleanup(void);
+extern void smgrmarkskipwalpending(RelFileLocator rlocator);
+extern void smgrclearskipwalpending(RelFileLocator rlocator);
+extern bool smgrpreparependingsync(SMgrRelation reln);
+extern bool smgrneedsrecoveryfsmvacuum(SMgrRelation reln);
+extern void smgrpretruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
+							BlockNumber *old_nblocks,
+							BlockNumber *nblocks,
+							XLogRecPtr truncate_lsn);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *old_nblocks,
 						 BlockNumber *nblocks);
@@ -150,7 +167,8 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
 									 SMgrRelationData *smgr,
 									 ForkNumber forknum,
-									 BlockNumber blocknum,
+									 BlockNumber logical_blocknum,
+									 BlockNumber physical_blocknum,
 									 int nblocks,
 									 bool skip_fsync);

diff --git a/src/include/storage/um_defs.h b/src/include/storage/um_defs.h
index 3b567a397e..b7ad5f4490 100644
--- a/src/include/storage/um_defs.h
+++ b/src/include/storage/um_defs.h
@@ -1,14 +1,14 @@
 /*-------------------------------------------------------------------------
  *
  * um_defs.h
- *	  Umbra low-level fork and metadata path definitions.
+ *    Umbra low-level fork and metadata path definitions.
  *
- * This header contains storage-layout facts shared by Umbra submodules.
- *
- * src/include/storage/um_defs.h
+ * This header intentionally contains only storage-layout facts shared by
+ * Umbra submodules. Higher-level MAP policy stays in umbra.h/umbra.c.
  *
  *-------------------------------------------------------------------------
  */
+
 #ifndef UM_DEFS_H
 #define UM_DEFS_H

@@ -16,13 +16,15 @@

#include "common/relpath.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"

 /*
- * Umbra reserves an extra fork slot for relation-local metadata.  This lives
- * outside PostgreSQL's built-in fork numbering so ordinary smgr loops do not
- * try to process it implicitly.
+ * Umbra internal metadata fork numbering.
+ *
+ * The numeric value still matches the historical MAP slot, but the definition
+ * lives here so low-level file/map code does not depend on umbra.h.
  */
-#define UMBRA_METADATA_FORKNUM	((ForkNumber) (INIT_FORKNUM + 1))
+#define UMBRA_METADATA_FORKNUM	((int) INIT_FORKNUM + 1)
 #define UMBRA_FORK_SLOTS		(UMBRA_METADATA_FORKNUM + 1)

 static inline RelPathStr
diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
index b41fae75ea..0702f7b392 100644
--- a/src/include/storage/umbra.h
+++ b/src/include/storage/umbra.h
@@ -1,25 +1,61 @@
 /*-------------------------------------------------------------------------
  *
  * umbra.h
- *	  Umbra storage manager public interface declarations.
+ *    Umbra storage manager public interface declarations.
  *
  * This header declares the Umbra smgr callback surface used by smgr.c when
  * the build is configured with --with-umbra.
  *
- * src/include/storage/umbra.h
- *
  *-------------------------------------------------------------------------
  */
+
 #ifndef UMBRA_H
 #define UMBRA_H

#include "storage/aio_types.h"
#include "storage/block.h"
+#include "common/relpath.h"
#include "storage/relfilelocator.h"
#include "storage/smgr.h"
#include "storage/sync.h"
#include "storage/um_defs.h"

+/*
+ * Umbra MAP policy.
+ *
+ * This is the handle-local Umbra access state. Mapped forks must not silently
+ * fall back to direct physical addressing unless upper layers explicitly allow
+ * it.
+ */
+typedef enum UmbraMapPolicy
+{
+	UMBRA_MAP_POLICY_UNKNOWN = 0,
+	UMBRA_MAP_POLICY_BYPASS_MAP,
+	UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP,
+	UMBRA_MAP_POLICY_REQUIRE_MAP,
+} UmbraMapPolicy;
+
+/*
+ * Umbra keeps MAIN/FSM/VM under mapping translation, but only MAIN uses the
+ * more involved page-WAL-owned first-born protocol. FSM/VM are auxiliary
+ * mapped forks with a more explicit producer set and a simpler traced-extend
+ * model.
+ */
+static inline bool
+UmbraForkUsesMapTranslation(ForkNumber forknum)
+{
+	return (forknum == MAIN_FORKNUM ||
+			forknum == FSM_FORKNUM ||
+			forknum == VISIBILITYMAP_FORKNUM);
+}
+
+static inline bool
+UmbraForkIsAuxiliaryMapped(ForkNumber forknum)
+{
+	return (forknum == FSM_FORKNUM ||
+			forknum == VISIBILITYMAP_FORKNUM);
+}
+
 extern bool UmMetadataExists(SMgrRelation reln);
 extern bool UmMetadataOpenOrCreate(SMgrRelation reln, bool isRedo, bool *created);
 extern BlockNumber UmMetadataNblocks(SMgrRelation reln);
@@ -31,20 +67,28 @@ extern void UmMetadataWriteSuperblock(RelFileLocatorBackend rlocator,
 extern void UmMetadataExtend(SMgrRelation reln, BlockNumber blkno,
 							 const void *buffer, bool skipFsync);
 extern void UmMetadataImmediateSync(SMgrRelation reln);
+extern void UmMetadataRegisterSync(SMgrRelation reln);
 extern void UmMetadataUnlink(RelFileLocatorBackend rlocator, bool isRedo);
-extern void UmInvalidateDatabase(Oid dbid);

+/* Umbra storage manager functionality (smgr callbacks). */
 extern void uminit(void);
+extern void umbeforeshmemexitcleanup(void);
 extern void umopen(SMgrRelation reln);
 extern void umclose(SMgrRelation reln, ForkNumber forknum);
 extern void umdestroy(SMgrRelation reln);
-extern bool umisinternalfork(ForkNumber forknum);
 extern bool umcreatedballowswallog(void);
+extern void uminitnewrelation(SMgrRelation reln, bool needs_wal);
+extern void umsetmapstate(SMgrRelation reln, uint8 map_state);
+extern void ummarkskipwalpending(SMgrRelation reln);
+extern void umclearskipwalpending(SMgrRelation reln);
+extern bool umisinternalfork(ForkNumber forknum);
 extern void umcreaterelationmetadata(SMgrRelation reln);
+extern void umredocreatefork(SMgrRelation reln, ForkNumber forknum,
+							 XLogRecPtr lsn);
 extern void umcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
 											const Oid *tablespace_ids);
 extern void uminvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
-											const Oid *tablespace_ids);
+											 const Oid *tablespace_ids);
 extern void umcopyrelationmetadata(SMgrRelation src, SMgrRelation dst,
 								   char relpersistence);
 extern void umsyncrelationmetadata(SMgrRelation reln);
@@ -55,8 +99,17 @@ extern bool umexists(SMgrRelation reln, ForkNumber forknum);
 extern void umunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void umextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern bool umapplyreservedrange(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber firstblock, BlockNumber nblocks,
+								 const BlockNumber *pblknos,
+								 XLogRecPtr lsn, bool skipFsync);
 extern void umzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
+extern void UmApplyReservedRangeRemap(SMgrRelation reln, ForkNumber forknum,
+									  BlockNumber firstblock, BlockNumber nblocks,
+									  const BlockNumber *pblknos,
+									  XLogRecPtr lsn, bool skipFsync);
+extern void UmRebuildMapAndSuperblockForSkipWAL(SMgrRelation reln);
 extern bool umprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, int nblocks);
 extern uint32 ummaxcombine(SMgrRelation reln, ForkNumber forknum,
@@ -64,22 +117,59 @@ extern uint32 ummaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void umreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void umstartreadv(PgAioHandle *ioh,
-						 SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						 void **buffers, BlockNumber nblocks);
 extern void umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void umwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber umnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void umpretruncate(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber old_blocks, BlockNumber nblocks,
+						  XLogRecPtr truncate_lsn);
 extern void umtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void umimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void umregistersync(SMgrRelation reln, ForkNumber forknum);
-extern int	umfd(SMgrRelation reln, ForkNumber forknum,
-				 BlockNumber blocknum, uint32 *off);
+extern bool umpreparependingsync(SMgrRelation reln);
+extern bool umneedsrecoveryfsmvacuum(SMgrRelation reln);
+extern int umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern int umsyncfiletag(const FileTag *ftag, char *path);
 extern int umunlinkfiletag(const FileTag *ftag, char *path);
 extern bool umfiletagmatches(const FileTag *ftag, const FileTag *candidate);

+/*
+ * Runtime semantic helpers.
+ *
+ * These consume Umbra access semantics (bypass/require-map/skip-pending) and
+ * expose the runtime answers upper layers use directly.
+ */
+extern BlockNumber umphysicalblock(SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber lblkno);
+extern BlockNumber umnblocks(SMgrRelation reln, ForkNumber forknum);
+extern BlockNumber umnblocks_cached(SMgrRelation reln, ForkNumber forknum);
+
+/*
+ * MAP fact / mutation helpers used by WAL and replay code.
+ *
+ * These expose mapping facts and mapping-state updates only. Runtime read-miss
+ * interpretation stays in umbra.c.
+ */
+extern void UmMapGetNewPbkno(SMgrRelation reln, ForkNumber forknum,
+							 BlockNumber lblkno, BlockNumber *new_pblkno,
+							 BlockNumber *old_pblkno);
+extern void UmMapReserveFreshPbkno(SMgrRelation reln, ForkNumber forknum,
+								   BlockNumber lblkno,
+								   BlockNumber *new_pblkno);
+extern bool UmMapAccessAvailable(SMgrRelation reln, ForkNumber forknum);
+extern bool UmWalOwnedRemapAvailable(SMgrRelation reln, ForkNumber forknum);
+extern bool UmWalOwnedFirstbornAvailable(SMgrRelation reln, ForkNumber forknum,
+										 BlockNumber lblkno);
+extern bool UmMapTryLookupPblkno(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber lblkno, BlockNumber *pblkno);
+extern bool UmMapIsLogicalUnmaterialized(SMgrRelation reln, ForkNumber forknum,
+										 BlockNumber lblkno);
+extern void UmMapSetMapping(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber lblkno, BlockNumber new_pblkno,
+							XLogRecPtr map_lsn);
+
 #endif							/* UMBRA_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0cbdf133ca..0abe8ff1a1 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -63,6 +63,7 @@ tests += {
       't/052_checkpoint_segment_missing.pl',
       't/053_umbra_map_superblock_watermark.pl',
       't/054_umbra_map_fork_policy.pl',
+      't/061_umbra_fsm_vm_map_translation.pl',
       't/063_umbra_mainfork_head_unlink_checkpoint.pl',
     ],
   },
diff --git a/src/test/recovery/t/061_umbra_fsm_vm_map_translation.pl b/src/test/recovery/t/061_umbra_fsm_vm_map_translation.pl
new file mode 100644
index 0000000000..607afcb01e
--- /dev/null
+++ b/src/test/recovery/t/061_umbra_fsm_vm_map_translation.pl
@@ -0,0 +1,117 @@
+# Verify FSM/VM forks participate in UMBRA MAP translation.
+#
+# In UMBRA mode this test checks that FSM and VM logical block 0 both get a
+# valid mapping entry in relation MAP fork (entry != 0xFFFFFFFF).
+#
+# In md mode, MAP fork does not exist and the test is skipped.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+$node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_fsm_vm_t(id int, payload text);
+INSERT INTO umb_fsm_vm_t
+SELECT g, repeat('x', 2000) FROM generate_series(1, 15000) g;
+VACUUM (FREEZE, ANALYZE) umb_fsm_vm_t;
+});
+
+my $fsm_size = $node->safe_psql(
+	'postgres', q{
+SELECT pg_relation_size('umb_fsm_vm_t', 'fsm');
+});
+my $vm_size = $node->safe_psql(
+	'postgres', q{
+SELECT pg_relation_size('umb_fsm_vm_t', 'vm');
+});
+
+cmp_ok($fsm_size, '>', 0, 'FSM fork has at least one page');
+cmp_ok($vm_size, '>', 0, 'VM fork has at least one page');
+
+# Under the current proportional MAP layout:
+#   block 0 = superblock
+#   block 1 = FSM map page 0
+#   block 2 = VM map page 0
+my $fsm_map_page = 1;
+my $vm_map_page = 2;
+
+my $fsm_map_entry = $node->safe_psql(
+	'postgres', qq{
+SELECT COALESCE(
+	encode(
+		pg_read_binary_file(
+			pg_relation_filepath('umb_fsm_vm_t') || '_map',
+			current_setting('block_size')::int * $fsm_map_page,
+			4,
+			true),
+		'hex'),
+	'');
+});
+
+my $vm_map_entry = $node->safe_psql(
+	'postgres', qq{
+SELECT COALESCE(
+	encode(
+		pg_read_binary_file(
+			pg_relation_filepath('umb_fsm_vm_t') || '_map',
+			current_setting('block_size')::int * $vm_map_page,
+			4,
+			true),
+		'hex'),
+	'');
+});
+
+isnt($fsm_map_entry, '', 'FSM map entry is readable');
+isnt($vm_map_entry, '', 'VM map entry is readable');
+isnt($fsm_map_entry, 'ffffffff', 'FSM block 0 has a valid map entry');
+isnt($vm_map_entry, 'ffffffff', 'VM block 0 has a valid map entry');
+
+$node->stop('immediate');
+$node->start();
+
+my $fsm_map_entry_after_restart = $node->safe_psql(
+	'postgres', qq{
+SELECT COALESCE(
+	encode(
+		pg_read_binary_file(
+			pg_relation_filepath('umb_fsm_vm_t') || '_map',
+			current_setting('block_size')::int * $fsm_map_page,
+			4,
+			true),
+		'hex'),
+	'');
+});
+
+my $vm_map_entry_after_restart = $node->safe_psql(
+	'postgres', qq{
+SELECT COALESCE(
+	encode(
+		pg_read_binary_file(
+			pg_relation_filepath('umb_fsm_vm_t') || '_map',
+			current_setting('block_size')::int * $vm_map_page,
+			4,
+			true),
+		'hex'),
+	'');
+});
+
+isnt($fsm_map_entry_after_restart, 'ffffffff',
+	'FSM block 0 map entry survives restart');
+isnt($vm_map_entry_after_restart, 'ffffffff',
+	'VM block 0 map entry survives restart');
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 07/10] umbra: add patch 6 WAL records, mapped birth, and redo state machine

---
src/backend/access/rmgrdesc/Makefile | 5 +
src/backend/access/rmgrdesc/meson.build | 6 +
src/backend/access/rmgrdesc/umbradesc.c | 81 ++++
src/backend/access/rmgrdesc/xlogdesc.c | 1 +
src/backend/access/transam/Makefile | 5 +
src/backend/access/transam/meson.build | 6 +
src/backend/access/transam/rmgr.c | 3 +
src/backend/access/transam/umbra_xlog.c | 227 ++++++++++
src/backend/access/transam/xlogutils.c | 400 ++++++++++++++++--
src/backend/storage/map/mapsuper.c | 3 +
src/backend/storage/smgr/umbra.c | 50 ++-
src/bin/pg_waldump/rmgrdesc.c | 3 +
src/include/access/rmgrlist.h | 3 +
src/include/access/umbra_xlog.h | 58 +++
src/include/access/xloginsert.h | 4 +
src/include/access/xlogrecord.h | 16 +
src/include/access/xlogutils.h | 3 +
src/include/catalog/storage.h | 1 +
src/test/recovery/meson.build | 4 +
.../t/056_umbra_truncate_superblock.pl | 82 ++++
.../t/062_umbra_truncate_drop_crash_matrix.pl | 108 +++++
.../recovery/t/066_umbra_truncate_redo.pl | 64 +++
.../t/071_umbra_skip_wal_dense_map.pl | 65 +++
23 files changed, 1147 insertions(+), 51 deletions(-)
create mode 100644 src/backend/access/rmgrdesc/umbradesc.c
create mode 100644 src/backend/access/transam/umbra_xlog.c
create mode 100644 src/include/access/umbra_xlog.h
create mode 100644 src/test/recovery/t/056_umbra_truncate_superblock.pl
create mode 100644 src/test/recovery/t/062_umbra_truncate_drop_crash_matrix.pl
create mode 100644 src/test/recovery/t/066_umbra_truncate_redo.pl
create mode 100644 src/test/recovery/t/071_umbra_skip_wal_dense_map.pl

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index cd95eec37f..4e9a52d8d3 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -32,4 +32,9 @@ OBJS = \
 	xactdesc.o \
 	xlogdesc.o

+ifeq ($(with_umbra), yes)
+OBJS += \
+	umbradesc.o
+endif
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/meson.build b/src/backend/access/rmgrdesc/meson.build
index d9000ccd9f..f70cdbb587 100644
--- a/src/backend/access/rmgrdesc/meson.build
+++ b/src/backend/access/rmgrdesc/meson.build
@@ -26,4 +26,10 @@ rmgr_desc_sources = files(
   'xlogdesc.c',
 )

+if get_option('umbra').enabled()
+  rmgr_desc_sources += files(
+    'umbradesc.c',
+  )
+endif
+
 backend_sources += rmgr_desc_sources
diff --git a/src/backend/access/rmgrdesc/umbradesc.c b/src/backend/access/rmgrdesc/umbradesc.c
new file mode 100644
index 0000000000..6bad4bb38e
--- /dev/null
+++ b/src/backend/access/rmgrdesc/umbradesc.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * umbradesc.c
+ *	  rmgr descriptor routines for Umbra MAP WAL records
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/umbra_xlog.h"
+#include "common/relpath.h"
+#include "storage/um_defs.h"
+
+static RelPathStr
+umbra_metadata_relpath(RelFileLocator rlocator)
+{
+	RelPathStr	base;
+	RelPathStr	path;
+
+	base = relpathperm(rlocator, MAIN_FORKNUM);
+	snprintf(path.str, sizeof(path.str), "%s_map", base.str);
+	return path;
+}
+
+static RelPathStr
+umbra_fork_relpath(RelFileLocator rlocator, ForkNumber forknum)
+{
+	if (forknum == UMBRA_METADATA_FORKNUM)
+		return umbra_metadata_relpath(rlocator);
+
+	return relpathperm(rlocator, forknum);
+}
+
+void
+umbra_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UMBRA_MAP_SET)
+	{
+		xl_umbra_map_set *xlrec = (xl_umbra_map_set *) rec;
+		RelPathStr	path = umbra_fork_relpath(xlrec->rlocator, xlrec->forknum);
+
+		appendStringInfo(buf, "%s lblk %u old %u new %u",
+						 path.str, xlrec->lblkno, xlrec->old_pblkno,
+						 xlrec->new_pblkno);
+	}
+	else if (info == XLOG_UMBRA_SKIP_WAL_DENSE_MAP)
+	{
+		xl_umbra_skip_wal_dense_map *xlrec =
+			(xl_umbra_skip_wal_dense_map *) rec;
+		RelPathStr	path = umbra_metadata_relpath(xlrec->rlocator);
+
+		appendStringInfo(buf, "%s skip_wal_dense count %u",
+						 path.str, xlrec->count);
+		for (uint16 i = 0; i < xlrec->count; i++)
+			appendStringInfo(buf, " fork %d nblocks %u",
+							 xlrec->entries[i].forknum,
+							 xlrec->entries[i].nblocks);
+	}
+}
+
+const char *
+umbra_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UMBRA_MAP_SET:
+			id = "MAP_SET";
+			break;
+		case XLOG_UMBRA_SKIP_WAL_DENSE_MAP:
+			id = "SKIP_WAL_DENSE_MAP";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 2468a7d257..0fc4f48ca6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -324,6 +324,7 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 		if (detailed_format)
 		{
 			/* Get block references in detailed format. */
+			DecodedBkpBlock *blkref = XLogRecGetBlock(record, block_id);

 			if (pretty)
 				appendStringInfoChar(buf, '\t');
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index a32f473e0a..920625d345 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -39,6 +39,11 @@ OBJS = \
 	xlogutils.o \
 	xlogwait.o

+ifeq ($(with_umbra), yes)
+OBJS += \
+	umbra_xlog.o
+endif
+
 include $(top_srcdir)/src/backend/common.mk

 # ensure that version checks in xlog.c get recompiled when catversion.h changes
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 06aadc7f31..57eaf44af8 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -27,6 +27,12 @@ backend_sources += files(
   'xlogwait.c',
 )

+if get_option('umbra').enabled()
+  backend_sources += files(
+    'umbra_xlog.c',
+  )
+endif
+
 # used by frontend programs to build a frontend xlogreader
 xlogreader_sources = files(
   'xlogreader.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 4fda03a3cf..bb6beaa71d 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,6 +30,9 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#ifdef USE_UMBRA
+#include "access/umbra_xlog.h"
+#endif
 #include "access/xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
diff --git a/src/backend/access/transam/umbra_xlog.c b/src/backend/access/transam/umbra_xlog.c
new file mode 100644
index 0000000000..71c7ad7bb1
--- /dev/null
+++ b/src/backend/access/transam/umbra_xlog.c
@@ -0,0 +1,227 @@
+/*-------------------------------------------------------------------------
+ *
+ * umbra_xlog.c
+ *	  WAL support for Umbra MAP lifecycle records.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/umbra_xlog.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "storage/map.h"
+#include "storage/smgr.h"
+#include "storage/umbra.h"
+#include "storage/umfile.h"
+
+/*
+ * Log a mapping establishment/switch for one logical block.
+ *
+ * The chosen physical block number is recorded in WAL so redo never allocates
+ * locally; that keeps mapping deterministic in recovery.
+ */
+XLogRecPtr
+log_umbra_map_set(RelFileLocator rlocator, ForkNumber forknum,
+				  BlockNumber lblkno, BlockNumber old_pblkno,
+				  BlockNumber new_pblkno)
+{
+	xl_umbra_map_set xlrec;
+
+	xlrec.rlocator = rlocator;
+	xlrec.forknum = forknum;
+	xlrec.lblkno = lblkno;
+	xlrec.old_pblkno = old_pblkno;
+	xlrec.new_pblkno = new_pblkno;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+	return XLogInsert(RM_UMBRA_ID, XLOG_UMBRA_MAP_SET | XLR_SPECIAL_REL_UPDATE);
+}
+
+XLogRecPtr
+log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
+							 uint16 count,
+							 const xl_umbra_skip_wal_dense_map_entry *entries)
+{
+	xl_umbra_skip_wal_dense_map xlrec;
+
+	Assert(count > 0);
+	Assert(entries != NULL);
+
+	xlrec.rlocator = rlocator;
+	xlrec.count = count;
+	xlrec.padding = 0;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec,
+					 offsetof(xl_umbra_skip_wal_dense_map, entries));
+	XLogRegisterData((char *) entries,
+					 sizeof(xl_umbra_skip_wal_dense_map_entry) * count);
+
+	return XLogInsert(RM_UMBRA_ID,
+					  XLOG_UMBRA_SKIP_WAL_DENSE_MAP | XLR_SPECIAL_REL_UPDATE);
+}
+
+void
+umbra_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in Umbra MAP records. */
+	Assert(!XLogRecHasAnyBlockRefs(record));
+
+	switch (info)
+	{
+		case XLOG_UMBRA_MAP_SET:
+			{
+				xl_umbra_map_set *xlrec = (xl_umbra_map_set *) XLogRecGetData(record);
+				SMgrRelation reln;
+				UmbraFileContext *ctx;
+				static const PGIOAlignedBlock zero_page = {{0}};
+				bool		materialized = false;
+
+				reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+				ctx = umfile_ctx_acquire(reln->smgr_rlocator);
+
+				/*
+				 * During replay of drop/tablespace churn, the relation path can
+				 * already be gone. Treat missing MAIN+MAP as stale WAL and skip.
+				 */
+				if (!UmMetadataExists(reln))
+					break;
+
+				/*
+				 * MAP_SET(old,new) replays the physical copy first, then switches
+				 * the mapping. This keeps crash recovery independent of whether
+				 * background flushing persisted the new physical page before crash.
+				 */
+				if (xlrec->old_pblkno != InvalidBlockNumber)
+				{
+					BlockNumber nblocks;
+					char		pagebuf[BLCKSZ];
+
+					nblocks = umfile_ctx_get_nblocks(ctx, xlrec->forknum,
+													 UMFILE_NBLOCKS_SPARSE);
+					if (xlrec->old_pblkno < nblocks)
+					{
+						umfile_ctx_read(ctx, xlrec->forknum, xlrec->old_pblkno,
+										pagebuf, BLCKSZ);
+						umfile_ctx_extend(ctx, xlrec->forknum, xlrec->new_pblkno,
+										  pagebuf);
+						umfile_ctx_register_dirty(ctx, xlrec->forknum,
+												  xlrec->new_pblkno,
+												  false, false);
+						materialized = true;
+					}
+					else
+					{
+						ereport(DEBUG1,
+								(errmsg_internal("skip UMBRA MAP_SET relocation replay for relation %u/%u/%u fork %d lblk %u: old pblk %u beyond nblocks %u",
+												 xlrec->rlocator.spcOid,
+												 xlrec->rlocator.dbOid,
+												 xlrec->rlocator.relNumber,
+												 xlrec->forknum,
+												 xlrec->lblkno,
+												 xlrec->old_pblkno,
+												 nblocks)));
+					}
+				}
+
+				MapSetMapping(ctx, xlrec->rlocator, xlrec->forknum,
+							  xlrec->lblkno, xlrec->new_pblkno,
+							  record->EndRecPtr);
+				MapSBlockBumpNextFreePhysBlock(ctx, xlrec->rlocator,
+											   xlrec->forknum,
+											   xlrec->new_pblkno + 1,
+											   record->EndRecPtr);
+
+				/*
+				 * MAP_SET with invalid old_pblkno means first mapping for this
+				 * logical block (extend/zeroextend path). Keep superblock
+				 * logical_nblocks in sync during redo as well.
+				 */
+				if (xlrec->old_pblkno == InvalidBlockNumber)
+				{
+					/*
+					 * Ensure the mapped physical page exists even if there is no
+					 * later WAL record that overwrites it (e.g. dummy pages used
+					 * to fill gaps for smgrextend semantics).  It's safe to write
+					 * zeros even if a later record will overwrite the page image.
+					 */
+					umfile_ctx_extend(ctx, xlrec->forknum, xlrec->new_pblkno,
+									  (const char *) zero_page.data);
+					umfile_ctx_register_dirty(ctx, xlrec->forknum,
+											  xlrec->new_pblkno,
+											  false, false);
+					materialized = true;
+
+					MapSBlockBumpLogicalNblocks(ctx, xlrec->rlocator,
+												xlrec->forknum,
+												xlrec->lblkno + 1,
+												record->EndRecPtr);
+				}
+
+				if (materialized)
+					MapSBlockBumpPhysicalNblocks(ctx, xlrec->rlocator,
+												 xlrec->forknum,
+												 xlrec->new_pblkno + 1,
+												 record->EndRecPtr);
+			}
+			break;
+
+		case XLOG_UMBRA_SKIP_WAL_DENSE_MAP:
+			{
+				xl_umbra_skip_wal_dense_map *xlrec;
+				xl_umbra_skip_wal_dense_map_entry *entries;
+				SMgrRelation reln;
+				UmbraFileContext *ctx;
+
+				xlrec = (xl_umbra_skip_wal_dense_map *) XLogRecGetData(record);
+				entries = xlrec->entries;
+				reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+				ctx = umfile_ctx_acquire(reln->smgr_rlocator);
+
+				if (!UmMetadataExists(reln))
+					break;
+
+				MapInvalidateRelation(xlrec->rlocator);
+
+				for (uint16 i = 0; i < xlrec->count; i++)
+				{
+					ForkNumber	forknum = entries[i].forknum;
+					BlockNumber nblocks = entries[i].nblocks;
+
+					if (!UmbraForkUsesMapTranslation(forknum) ||
+						!BlockNumberIsValid(nblocks))
+						elog(PANIC,
+							 "invalid UMBRA skip-WAL dense-map record for relation %u/%u/%u fork %d nblocks %u",
+							 xlrec->rlocator.spcOid,
+							 xlrec->rlocator.dbOid,
+							 xlrec->rlocator.relNumber,
+							 forknum, nblocks);
+					Assert(nblocks > 0);
+
+					for (BlockNumber lblk = 0; lblk < nblocks; lblk++)
+						MapSetMapping(ctx, xlrec->rlocator, forknum,
+									  lblk, lblk, record->EndRecPtr);
+
+					MapSBlockBumpNextFreePhysBlock(ctx, xlrec->rlocator,
+												   forknum, nblocks,
+												   record->EndRecPtr);
+					MapSBlockBumpPhysicalNblocks(ctx, xlrec->rlocator,
+												 forknum, nblocks,
+												 record->EndRecPtr);
+					MapSBlockSetLogicalNblocks(ctx, xlrec->rlocator,
+											   forknum, nblocks,
+											   record->EndRecPtr);
+				}
+			}
+			break;
+
+		default:
+			elog(PANIC, "umbra_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5fbe39133b..f32aac5476 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,10 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#ifdef USE_UMBRA
+#include "storage/map.h"
+#include "storage/umbra.h"
+#endif
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/rel.h"
@@ -77,10 +81,45 @@ typedef struct xl_invalid_page

static HTAB *invalid_page_tab = NULL;

+#ifdef USE_UMBRA
+typedef struct xl_missing_metadata_key
+{
+	RelFileLocator locator;		/* relation whose Umbra metadata is missing */
+} xl_missing_metadata_key;
+
+typedef struct xl_missing_metadata
+{
+	xl_missing_metadata_key key;	/* hash key ... must be first */
+} xl_missing_metadata;
+
+static HTAB *missing_metadata_tab = NULL;
+#endif
+
 static int	read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 									  int reqLen, XLogRecPtr targetRecPtr,
 									  char *cur_page, bool wait_for_wal);

+#ifndef USE_UMBRA
+static XLogRedoAction XLogReadBufferForRedoExtendedMd(XLogReaderState *record,
+													 uint8 block_id,
+													 ReadBufferMode mode,
+													 bool get_cleanup_lock,
+													 Buffer *buf);
+#endif
+#ifdef USE_UMBRA
+static XLogRedoAction XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
+														uint8 block_id,
+														ReadBufferMode mode,
+														bool get_cleanup_lock,
+														Buffer *buf);
+static uint8 XLogUmbraMapStateForRedo(SMgrRelation smgr, ForkNumber forknum);
+static bool XLogUmbraEnsureMappedBlockForRedo(RelFileLocator rlocator,
+											  ForkNumber forknum,
+											  BlockNumber blkno);
+static bool XLogUmbraEnsureMetadataForRedo(RelFileLocator rlocator,
+										   ForkNumber forknum);
+#endif
+
 /* Report a reference to an invalid page */
 static void
 report_invalid_page(int elevel, RelFileLocator locator, ForkNumber forkno,
@@ -96,6 +135,16 @@ report_invalid_page(int elevel, RelFileLocator locator, ForkNumber forkno,
 			 blkno, path.str);
 }

+#ifdef USE_UMBRA
+static void
+report_missing_metadata(int elevel, RelFileLocator locator)
+{
+	RelPathStr	path = UmMetadataRelPathPerm(locator);
+
+	elog(elevel, "MAP metadata for relation %s is missing", path.str);
+}
+#endif
+
 /* Log a reference to an invalid page */
 static void
 log_invalid_page(RelFileLocator locator, ForkNumber forkno, BlockNumber blkno,
@@ -160,6 +209,39 @@ log_invalid_page(RelFileLocator locator, ForkNumber forkno, BlockNumber blkno,
 	}
 }

+#ifdef USE_UMBRA
+void
+XLogLogMissingRelationMetadata(RelFileLocator locator)
+{
+	xl_missing_metadata_key key;
+	xl_missing_metadata *hentry;
+	bool		found;
+
+	if (message_level_is_interesting(DEBUG1))
+		report_missing_metadata(DEBUG1, locator);
+
+	if (missing_metadata_tab == NULL)
+	{
+		HASHCTL		ctl;
+
+		ctl.keysize = sizeof(xl_missing_metadata_key);
+		ctl.entrysize = sizeof(xl_missing_metadata);
+
+		missing_metadata_tab = hash_create("XLOG missing-metadata table",
+										   32,
+										   &ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.locator = locator;
+	hentry = (xl_missing_metadata *)
+		hash_search(missing_metadata_tab, &key, HASH_ENTER, &found);
+
+	(void) hentry;
+	(void) found;
+}
+#endif
+
 /* Forget any invalid pages >= minblkno, because they've been dropped */
 static void
 forget_invalid_pages(RelFileLocator locator, ForkNumber forkno,
@@ -219,6 +301,48 @@ forget_invalid_pages_db(Oid dbid)
 	}
 }

+#ifdef USE_UMBRA
+static void
+forget_missing_metadata(RelFileLocator locator)
+{
+	xl_missing_metadata_key key;
+
+	if (missing_metadata_tab == NULL)
+		return;
+
+	key.locator = locator;
+	if (hash_search(missing_metadata_tab, &key, HASH_REMOVE, NULL) != NULL)
+		elog(DEBUG2, "MAP metadata for relation %s has been resolved",
+			 UmMetadataRelPathPerm(locator).str);
+}
+
+static void
+forget_missing_metadata_db(Oid dbid)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_metadata *hentry;
+
+	if (missing_metadata_tab == NULL)
+		return;
+
+	hash_seq_init(&status, missing_metadata_tab);
+
+	while ((hentry = (xl_missing_metadata *) hash_seq_search(&status)) != NULL)
+	{
+		if (hentry->key.locator.dbOid == dbid)
+		{
+			elog(DEBUG2, "MAP metadata for relation %s has been resolved",
+				 UmMetadataRelPathPerm(hentry->key.locator).str);
+
+			if (hash_search(missing_metadata_tab,
+							&hentry->key,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "hash table corrupted");
+		}
+	}
+}
+#endif
+
 /* Are there any unresolved references to invalid pages? */
 bool
 XLogHaveInvalidPages(void)
@@ -226,6 +350,11 @@ XLogHaveInvalidPages(void)
 	if (invalid_page_tab != NULL &&
 		hash_get_num_entries(invalid_page_tab) > 0)
 		return true;
+#ifdef USE_UMBRA
+	if (missing_metadata_tab != NULL &&
+		hash_get_num_entries(missing_metadata_tab) > 0)
+		return true;
+#endif
 	return false;
 }

@@ -237,21 +366,43 @@ XLogCheckInvalidPages(void)
xl_invalid_page *hentry;
bool foundone = false;

+#ifdef USE_UMBRA
+	if (invalid_page_tab == NULL && missing_metadata_tab == NULL)
+#else
 	if (invalid_page_tab == NULL)
+#endif
 		return;					/* nothing to do */

-	hash_seq_init(&status, invalid_page_tab);
+	if (invalid_page_tab != NULL)
+	{
+		hash_seq_init(&status, invalid_page_tab);

-	/*
-	 * Our strategy is to emit WARNING messages for all remaining entries and
-	 * only PANIC after we've dumped all the available info.
-	 */
-	while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
+		/*
+		 * Our strategy is to emit WARNING messages for all remaining entries and
+		 * only PANIC after we've dumped all the available info.
+		 */
+		while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
+		{
+			report_invalid_page(WARNING, hentry->key.locator, hentry->key.forkno,
+								hentry->key.blkno, hentry->present);
+			foundone = true;
+		}
+	}
+
+#ifdef USE_UMBRA
+	if (missing_metadata_tab != NULL)
 	{
-		report_invalid_page(WARNING, hentry->key.locator, hentry->key.forkno,
-							hentry->key.blkno, hentry->present);
-		foundone = true;
+		HASH_SEQ_STATUS missing_status;
+		xl_missing_metadata *mentry;
+
+		hash_seq_init(&missing_status, missing_metadata_tab);
+		while ((mentry = (xl_missing_metadata *) hash_seq_search(&missing_status)) != NULL)
+		{
+			report_missing_metadata(WARNING, mentry->key.locator);
+			foundone = true;
+		}
 	}
+#endif

if (foundone)
elog(ignore_invalid_pages ? WARNING : PANIC,
@@ -259,8 +410,99 @@ XLogCheckInvalidPages(void)

 	hash_destroy(invalid_page_tab);
 	invalid_page_tab = NULL;
+
+#ifdef USE_UMBRA
+	if (missing_metadata_tab != NULL)
+	{
+		hash_destroy(missing_metadata_tab);
+		missing_metadata_tab = NULL;
+	}
+#endif
+}
+
+#ifdef USE_UMBRA
+/*
+ * Redo is an owner point for handle-local Umbra MAP state.
+ *
+ * Mapped permanent forks replay under REQUIRE_MAP even if a later restartpoint
+ * cleanup already removed the MAP fork from disk. The only runtime override
+ * is the durable skip-WAL-pending bit stored in the MAP superblock.
+ * INIT/internal/temp forks stay on BYPASS_MAP.
+ */
+static uint8
+XLogUmbraMapStateForRedo(SMgrRelation smgr, ForkNumber forknum)
+{
+	UmbraFileContext *ctx = umfile_ctx_acquire(smgr->smgr_rlocator);
+
+	if (!UmbraForkUsesMapTranslation(forknum) ||
+		forknum == INIT_FORKNUM ||
+		smgrisinternalfork(forknum) ||
+		RelFileLocatorBackendIsTemp(smgr->smgr_rlocator))
+		return UMBRA_MAP_POLICY_BYPASS_MAP;
+
+	if (UmMetadataExists(smgr) &&
+		MapSBlockIsSkipWalPending(ctx, smgr->smgr_rlocator.locator))
+		return UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP;
+
+	return UMBRA_MAP_POLICY_REQUIRE_MAP;
+}
+
+/*
+ * For mapped forks, redo must not silently invent local mappings.
+ * Missing mapping source is treated as an invalid-page reference.
+ */
+static bool
+XLogUmbraEnsureMappedBlockForRedo(RelFileLocator rlocator, ForkNumber forknum,
+								  BlockNumber blkno)
+{
+	SMgrRelation smgr;
+	BlockNumber	pblkno;
+
+	if (forknum == INIT_FORKNUM || smgrisinternalfork(forknum))
+		return true;
+
+	smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
+
+	if (!UmMetadataExists(smgr))
+	{
+		XLogLogMissingRelationMetadata(rlocator);
+		return false;
+	}
+
+	if (!UmMapTryLookupPblkno(smgr, forknum, blkno, &pblkno))
+	{
+		if (UmMapIsLogicalUnmaterialized(smgr, forknum, blkno))
+			return true;
+		log_invalid_page(rlocator, forknum, blkno, false);
+		return false;
+	}
+
+	return true;
 }

+static bool
+XLogUmbraEnsureMetadataForRedo(RelFileLocator rlocator, ForkNumber forknum)
+{
+	SMgrRelation smgr;
+	uint8		map_state;
+
+	if (forknum == INIT_FORKNUM || smgrisinternalfork(forknum))
+		return true;
+
+	smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
+	map_state = XLogUmbraMapStateForRedo(smgr, forknum);
+	smgrsetmapstate(smgr, map_state);
+	if (map_state != UMBRA_MAP_POLICY_REQUIRE_MAP)
+		return true;
+
+	if (!UmMetadataExists(smgr))
+		smgrcreaterelationmetadata(smgr);
+
+	return true;
+}
+
+#endif
+

 /*
  * XLogReadBufferForRedo
@@ -341,6 +583,22 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 							  uint8 block_id,
 							  ReadBufferMode mode, bool get_cleanup_lock,
 							  Buffer *buf)
+{
+ #ifdef USE_UMBRA
+	return XLogReadBufferForRedoExtendedUmbra(record, block_id, mode,
+											  get_cleanup_lock, buf);
+ #else
+	return XLogReadBufferForRedoExtendedMd(record, block_id, mode,
+										   get_cleanup_lock, buf);
+ #endif
+}
+
+#ifndef USE_UMBRA
+static XLogRedoAction
+XLogReadBufferForRedoExtendedMd(XLogReaderState *record,
+								uint8 block_id,
+								ReadBufferMode mode, bool get_cleanup_lock,
+								Buffer *buf)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	RelFileLocator rlocator;
@@ -354,15 +612,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno,
 									&prefetch_buffer))
 	{
-		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d in WAL record",
 			 block_id);
 	}

-	/*
-	 * Make sure that if the block is marked with WILL_INIT, the caller is
-	 * going to initialize it. And vice versa.
-	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 	willinit = (XLogRecGetBlock(record, block_id)->flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
@@ -370,7 +623,6 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");

-	/* If it has a full-page image and it should be restored, do it. */
 	if (XLogRecBlockImageApply(record, block_id))
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
@@ -383,49 +635,105 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 					(errcode(ERRCODE_INTERNAL_ERROR),
 					 errmsg_internal("%s", record->errormsg_buf)));

- /*
- * The page may be uninitialized. If so, we can't set the LSN because
- * that would corrupt the page.
- */
if (!PageIsNew(page))
- {
PageSetLSN(page, lsn);
- }

MarkBufferDirty(*buf);
-
- /*
- * At the end of crash recovery the init forks of unlogged relations
- * are copied, without going through shared buffers. So we need to
- * force the on-disk state of init forks to always be in sync with the
- * state in shared buffers.
- */
if (forknum == INIT_FORKNUM)
FlushOneBuffer(*buf);

 		return BLK_RESTORED;
 	}
-	else
+
+	*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode, prefetch_buffer);
+	if (BufferIsValid(*buf))
 	{
-		*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode, prefetch_buffer);
-		if (BufferIsValid(*buf))
+		if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
 		{
-			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
-			{
-				if (get_cleanup_lock)
-					LockBufferForCleanup(*buf);
-				else
-					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
-			}
-			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
-				return BLK_DONE;
+			if (get_cleanup_lock)
+				LockBufferForCleanup(*buf);
 			else
-				return BLK_NEEDS_REDO;
+				LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 		}
-		else
-			return BLK_NOTFOUND;
+		if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+			return BLK_DONE;
+		return BLK_NEEDS_REDO;
 	}
+
+	return BLK_NOTFOUND;
+}
+#endif
+
+#ifdef USE_UMBRA
+static XLogRedoAction
+XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
+								   uint8 block_id,
+								   ReadBufferMode mode, bool get_cleanup_lock,
+								   Buffer *buf)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	RelFileLocator rlocator;
+	ForkNumber	forknum;
+	BlockNumber blkno;
+	Buffer		prefetch_buffer;
+	Page		page;
+	bool		zeromode;
+	bool		willinit;
+
+	if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno,
+									&prefetch_buffer))
+	{
+		elog(PANIC, "failed to locate backup block with ID %d in WAL record",
+			 block_id);
+	}
+
+	*buf = InvalidBuffer;
+
+	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+	willinit = (XLogRecGetBlock(record, block_id)->flags & BKPBLOCK_WILL_INIT) != 0;
+	if (willinit && !zeromode)
+		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
+	if (!willinit && zeromode)
+		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
+
+	if (!XLogUmbraEnsureMetadataForRedo(rlocator, forknum))
+		return BLK_NOTFOUND;
+
+	if (XLogRecBlockImageApply(record, block_id))
+	{
+		Assert(XLogRecHasBlockImage(record, block_id));
+		*buf = XLogReadBufferExtended(rlocator, forknum, blkno,
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
+		page = BufferGetPage(*buf);
+		if (!RestoreBlockImage(record, block_id, page))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg_internal("%s", record->errormsg_buf)));
+
+		if (!PageIsNew(page))
+			PageSetLSN(page, lsn);
+
+		MarkBufferDirty(*buf);
+		if (forknum == INIT_FORKNUM)
+			FlushOneBuffer(*buf);
+
+		return BLK_RESTORED;
+	}
+
+	if (!XLogUmbraEnsureMappedBlockForRedo(rlocator, forknum, blkno))
+		return BLK_NOTFOUND;
+
+	*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode,
+								  prefetch_buffer);
+	if (!BufferIsValid(*buf))
+		return BLK_NOTFOUND;
+
+	if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+		return BLK_DONE;
+	return BLK_NEEDS_REDO;
 }
+#endif

/*
* XLogReadBufferExtended
@@ -630,6 +938,9 @@ void
XLogDropRelation(RelFileLocator rlocator, ForkNumber forknum)
{
forget_invalid_pages(rlocator, forknum, 0);
+#ifdef USE_UMBRA
+ forget_missing_metadata(rlocator);
+#endif
}

/*
@@ -649,6 +960,9 @@ XLogDropDatabase(Oid dbid)
smgrdestroyall();

 	forget_invalid_pages_db(dbid);
+#ifdef USE_UMBRA
+	forget_missing_metadata_db(dbid);
+#endif
 }

 /*
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index ad4a6f6bdb..3d8909f7a4 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -858,7 +858,10 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 		if (status == MAP_SBLOCK_READ_MISSING)
 		{
 			if (InRecovery)
+			{
+				XLogLogMissingRelationMetadata(rnode);
 				return false;
+			}
 			elog(ERROR, "%s", missing_errmsg);
 		}

diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index 2baf64defe..917dff0a64 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -28,6 +28,7 @@

#include "postgres.h"

+#include "access/umbra_xlog.h"
 #include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -793,6 +794,12 @@ void
 UmRebuildMapAndSuperblockForSkipWAL(SMgrRelation reln)
 {
 	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	xl_umbra_skip_wal_dense_map_entry apply_entries[MAX_FORKNUM + 1];
+	xl_umbra_skip_wal_dense_map_entry wal_entries[MAX_FORKNUM + 1];
+	uint16		apply_count = 0;
+	uint16		wal_count = 0;
+	XLogRecPtr	map_lsn = InvalidXLogRecPtr;
+	bool		wal_insert_enabled;

/*
* Rebuild assumes the relation stayed on direct lblk==pblk access during
@@ -812,27 +819,54 @@ UmRebuildMapAndSuperblockForSkipWAL(SMgrRelation reln)
continue;

if (!umfile_exists(ctx, forknum, UMFILE_EXISTS_DENSE))
- {
- MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
- forknum, 0, InvalidXLogRecPtr);
continue;
- }

 		nblocks = umfile_nblocks(ctx, forknum, UMFILE_NBLOCKS_DENSE);
+		apply_entries[apply_count].forknum = forknum;
+		apply_entries[apply_count].nblocks = nblocks;
+		apply_count++;
+
+		/*
+		 * The redo anchor records dense [0, nblocks) mapping. Empty forks
+		 * don't need an anchor and may correspond to zero-length metadata left
+		 * by aborted storage operations.
+		 */
+		if (nblocks > 0)
+		{
+			wal_entries[wal_count].forknum = forknum;
+			wal_entries[wal_count].nblocks = nblocks;
+			wal_count++;
+		}
+	}
+
+	wal_insert_enabled =
+		XLogInsertAllowed() &&
+		!IsBootstrapProcessingMode() &&
+		!IsInitProcessingMode();
+
+	if (wal_count > 0 && wal_insert_enabled)
+		map_lsn = log_umbra_skip_wal_dense_map(reln->smgr_rlocator.locator,
+											   wal_count, wal_entries);
+
+	for (uint16 i = 0; i < apply_count; i++)
+	{
+		ForkNumber	forknum = apply_entries[i].forknum;
+		BlockNumber nblocks = apply_entries[i].nblocks;
+		XLogRecPtr	fork_lsn = nblocks > 0 ? map_lsn : InvalidXLogRecPtr;

 		for (BlockNumber lblk = 0; lblk < nblocks; lblk++)
 			MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum,
-						  lblk, lblk, InvalidXLogRecPtr);
+						  lblk, lblk, fork_lsn);

 		if (nblocks > 0)
 		{
 			MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
-										   forknum, nblocks, InvalidXLogRecPtr);
+										   forknum, nblocks, fork_lsn);
 			MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
-										 forknum, nblocks, InvalidXLogRecPtr);
+										 forknum, nblocks, fork_lsn);
 		}
 		MapSBlockSetLogicalNblocks(ctx, reln->smgr_rlocator.locator,
-								   forknum, nblocks, InvalidXLogRecPtr);
+								   forknum, nblocks, fork_lsn);
 	}
 }

diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 931ab8b979..21fc620afa 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,9 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#ifdef USE_UMBRA
+#include "access/umbra_xlog.h"
+#endif
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index ae32ef16d6..3ca7d45ca4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,4 +47,7 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+#ifdef USE_UMBRA
+PG_RMGR(RM_UMBRA_ID, "Umbra", umbra_redo, umbra_desc, umbra_identify, NULL, NULL, NULL, NULL)
+#endif
 PG_RMGR(RM_XLOG2_ID, "XLOG2", xlog2_redo, xlog2_desc, xlog2_identify, NULL, NULL, NULL, xlog2_decode)
diff --git a/src/include/access/umbra_xlog.h b/src/include/access/umbra_xlog.h
new file mode 100644
index 0000000000..cb0c2bac57
--- /dev/null
+++ b/src/include/access/umbra_xlog.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * umbra_xlog.h
+ *	  WAL support for Umbra MAP metadata.
+ *
+ * Umbra logs these record types:
+ * - MAP_SET: establish/switch lblkno -> pblkno mapping
+ * - SKIP_WAL_DENSE_MAP: record non-empty skip-WAL dense lblk==pblk frontiers
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UMBRA_XLOG_H
+#define UMBRA_XLOG_H
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* XLOG gives us high 4 bits */
+#define XLOG_UMBRA_MAP_SET			0x10
+#define XLOG_UMBRA_SKIP_WAL_DENSE_MAP	0x60
+
+typedef struct xl_umbra_map_set
+{
+	RelFileLocator rlocator;
+	ForkNumber	forknum;
+	BlockNumber lblkno;
+	BlockNumber old_pblkno;
+	BlockNumber new_pblkno;
+} xl_umbra_map_set;
+
+typedef struct xl_umbra_skip_wal_dense_map_entry
+{
+	ForkNumber	forknum;
+	BlockNumber nblocks;
+} xl_umbra_skip_wal_dense_map_entry;
+
+typedef struct xl_umbra_skip_wal_dense_map
+{
+	RelFileLocator rlocator;
+	uint16		count;
+	uint16		padding;
+	xl_umbra_skip_wal_dense_map_entry entries[FLEXIBLE_ARRAY_MEMBER];
+} xl_umbra_skip_wal_dense_map;
+
+extern XLogRecPtr log_umbra_map_set(RelFileLocator rlocator, ForkNumber forknum,
+									BlockNumber lblkno, BlockNumber old_pblkno,
+									BlockNumber new_pblkno);
+extern XLogRecPtr log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
+											   uint16 count,
+											   const xl_umbra_skip_wal_dense_map_entry *entries);
+
+extern void umbra_redo(XLogReaderState *record);
+extern void umbra_desc(StringInfo buf, XLogReaderState *record);
+extern const char *umbra_identify(uint8 info);
+
+#endif							/* UMBRA_XLOG_H */
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 91dfbd5627..073d6e2ee7 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -33,12 +33,16 @@
 #define REGBUF_NO_IMAGE		0x02	/* don't take a full-page image */
 #define REGBUF_WILL_INIT	(0x04 | 0x02)	/* page will be re-initialized at
 											 * replay (implies NO_IMAGE) */
+#define REGBUF_WILL_INIT_BIRTH \
+	(REGBUF_WILL_INIT | REGBUF_LOGICAL_BIRTH)
 #define REGBUF_STANDARD		0x08	/* page follows "standard" page layout,
 									 * (data between pd_lower and pd_upper
 									 * will be skipped) */
 #define REGBUF_KEEP_DATA	0x10	/* include data even if a full-page image
 									 * is taken */
 #define REGBUF_NO_CHANGE	0x20	/* intentionally register clean buffer */
+#define REGBUF_LOGICAL_BIRTH 0x40 /* this record publishes a logical page
+									 * birth/rebirth mapping */

 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index e8999d3fe9..80764f9a26 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -90,6 +90,22 @@ typedef struct XLogRecord
  */
 #define XLR_CHECK_CONSISTENCY	0x02

+/*
+ * Legacy Umbra-only record flags formerly used for compact remap encodings.
+ *
+ * New WAL records always use the full remap header. Reader-side code keeps
+ * these bits only to reject unsupported old-format records explicitly.
+ */
+#ifdef USE_UMBRA
+#define XLR_UMBRA_REMAP_FORMAT_MASK		0x0C
+#define XLR_UMBRA_COMPACT_BIRTH_REMAP	0x04
+#define XLR_UMBRA_ORDINARY_SLIM_REMAP	0x08
+#else
+#define XLR_UMBRA_REMAP_FORMAT_MASK		0x00
+#define XLR_UMBRA_COMPACT_BIRTH_REMAP	0x00
+#define XLR_UMBRA_ORDINARY_SLIM_REMAP	0x00
+#endif
+
 /*
  * Header info for block data appended to an XLOG record.
  *
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index b97387c6d4..a59a3da69f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -61,6 +61,9 @@ extern PGDLLIMPORT HotStandbyState standbyState;

 extern bool XLogHaveInvalidPages(void);
+#ifdef USE_UMBRA
+extern void XLogLogMissingRelationMetadata(RelFileLocator locator);
+#endif
 extern void XLogCheckInvalidPages(void);

 extern void XLogDropRelation(RelFileLocator rlocator, ForkNumber forknum);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 70f619a6d6..3666e1d702 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -32,6 +32,7 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 								ForkNumber forkNum, char relpersistence);
 extern bool RelFileLocatorSkippingWAL(RelFileLocator rlocator);
+extern bool RelFileLocatorWasTruncated(RelFileLocator rlocator);
 extern Size EstimatePendingSyncsSpace(void);
 extern void SerializePendingSyncs(Size maxSize, char *startAddress);
 extern void RestorePendingSyncs(char *startAddress);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0abe8ff1a1..a527f446f2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -63,8 +63,12 @@ tests += {
       't/052_checkpoint_segment_missing.pl',
       't/053_umbra_map_superblock_watermark.pl',
       't/054_umbra_map_fork_policy.pl',
+      't/056_umbra_truncate_superblock.pl',
       't/061_umbra_fsm_vm_map_translation.pl',
+      't/062_umbra_truncate_drop_crash_matrix.pl',
       't/063_umbra_mainfork_head_unlink_checkpoint.pl',
+      't/066_umbra_truncate_redo.pl',
+      't/071_umbra_skip_wal_dense_map.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/056_umbra_truncate_superblock.pl b/src/test/recovery/t/056_umbra_truncate_superblock.pl
new file mode 100644
index 0000000000..db8d34c7a0
--- /dev/null
+++ b/src/test/recovery/t/056_umbra_truncate_superblock.pl
@@ -0,0 +1,82 @@
+# Verify TRUNCATE updates MAP superblock logical_nblocks and survives restart.
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+sub u32le_from_hex
+{
+	my ($hex, $offset) = @_;
+	my $chunk = substr($hex, $offset * 2, 8);
+	my @b = ($chunk =~ /../g);
+
+	return hex($b[0]) +
+	  (hex($b[1]) << 8) +
+	  (hex($b[2]) << 16) +
+	  (hex($b[3]) << 24);
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+$node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_truncate_t(a int, b text);
+INSERT INTO umb_truncate_t
+SELECT g, repeat('t', 400) FROM generate_series(1, 20000) g;
+CHECKPOINT;
+});
+
+my $map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('umb_truncate_t') || '_map', 0, 64, true), 'hex');}
+);
+
+my $logical_before = u32le_from_hex($map_super_hex, 40);
+cmp_ok($logical_before, '>', 0, 'logical_nblocks_main is non-zero before TRUNCATE');
+
+$node->safe_psql(
+	'postgres', q{
+TRUNCATE umb_truncate_t;
+CHECKPOINT;
+});
+
+my $logical_size_after = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_size('umb_truncate_t') / current_setting('block_size')::int;}
+);
+
+$map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('umb_truncate_t') || '_map', 0, 64, true), 'hex');}
+);
+my $logical_after = u32le_from_hex($map_super_hex, 40);
+
+is($logical_size_after, '0', 'relation size is zero blocks after TRUNCATE');
+is($logical_after, 0, 'superblock logical_nblocks_main is zero after TRUNCATE');
+
+$node->stop('immediate');
+$node->start();
+
+$map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('umb_truncate_t') || '_map', 0, 64, true), 'hex');}
+);
+my $logical_after_restart = u32le_from_hex($map_super_hex, 40);
+
+is($logical_after_restart, 0,
+	'superblock logical_nblocks_main remains zero after restart');
+
+done_testing();
diff --git a/src/test/recovery/t/062_umbra_truncate_drop_crash_matrix.pl b/src/test/recovery/t/062_umbra_truncate_drop_crash_matrix.pl
new file mode 100644
index 0000000000..18dd441d5e
--- /dev/null
+++ b/src/test/recovery/t/062_umbra_truncate_drop_crash_matrix.pl
@@ -0,0 +1,108 @@
+# Verify UMBRA truncate/drop behavior across crash restart.
+#
+# Matrix intent:
+# - TRUNCATE result survives crash restart (logical size and superblock logical_nblocks)
+# - DROP result survives crash restart
+# - dropped relation MAP fork disappears after a post-restart checkpoint
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+sub u32le_from_hex
+{
+	my ($hex, $offset) = @_;
+	my $chunk = substr($hex, $offset * 2, 8);
+	my @b = ($chunk =~ /../g);
+
+	return hex($b[0]) +
+	  (hex($b[1]) << 8) +
+	  (hex($b[2]) << 16) +
+	  (hex($b[3]) << 24);
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+ $node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_mx_trunc_t(id int, payload text);
+INSERT INTO umb_mx_trunc_t
+SELECT g, repeat('a', 500) FROM generate_series(1, 18000) g;
+CREATE TABLE umb_mx_drop_t(id int, payload text);
+INSERT INTO umb_mx_drop_t
+SELECT g, repeat('b', 500) FROM generate_series(1, 18000) g;
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_mx_trunc_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+cmp_ok(
+	$node->safe_psql(
+		'postgres',
+		q{SELECT pg_relation_size('umb_mx_trunc_t') / current_setting('block_size')::int;}),
+	'>',
+	0,
+	'truncate relation logical size is non-zero before TRUNCATE');
+
+my $drop_map_path = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_filepath('umb_mx_drop_t') || '_map';}
+);
+
+$node->safe_psql(
+	'postgres', q{
+TRUNCATE umb_mx_trunc_t;
+DROP TABLE umb_mx_drop_t;
+});
+
+$node->stop('immediate');
+$node->start();
+
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_mx_trunc_t;}), '0',
+	'TRUNCATE result survives crash restart');
+is($node->safe_psql(
+		'postgres',
+		q{SELECT pg_relation_size('umb_mx_trunc_t') / current_setting('block_size')::int;}),
+	'0',
+	'truncated relation logical size is zero blocks after restart');
+
+my $trunc_map_hex_after = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('umb_mx_trunc_t') || '_map', 0, 64, true), 'hex');}
+);
+my $trunc_logical_after = u32le_from_hex($trunc_map_hex_after, 40);
+is($trunc_logical_after, 0,
+	'superblock logical_nblocks_main remains zero after crash restart');
+
+is($node->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM pg_class WHERE relname = 'umb_mx_drop_t';}),
+	'0',
+	'DROP result survives crash restart');
+
+$node->safe_psql('postgres', q{CHECKPOINT;});
+ok($node->poll_query_until('postgres',
+		"SELECT COALESCE(encode(pg_read_binary_file('$drop_map_path', 0, 1, true), 'hex'), '') = '';",
+		't'),
+	'dropped relation MAP fork disappears after post-restart checkpoint');
+
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO umb_mx_trunc_t
+SELECT g, repeat('c', 300) FROM generate_series(1, 1000) g;
+});
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_mx_trunc_t;}), '1000',
+	'truncated relation remains writable after crash restart');
+
+done_testing();
diff --git a/src/test/recovery/t/066_umbra_truncate_redo.pl b/src/test/recovery/t/066_umbra_truncate_redo.pl
new file mode 100644
index 0000000000..5cef03be87
--- /dev/null
+++ b/src/test/recovery/t/066_umbra_truncate_redo.pl
@@ -0,0 +1,64 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_truncate');
+
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+]);
+$node->start();
+
+$node->safe_psql(
+	'postgres', q[
+CREATE TABLE umbra_trunc(i int);
+INSERT INTO umbra_trunc
+SELECT generate_series(1, 1000);
+CHECKPOINT;
+TRUNCATE umbra_trunc;
+INSERT INTO umbra_trunc
+SELECT generate_series(1, 10);
+UPDATE umbra_trunc
+SET i = i + 100;
+]);
+
+$node->stop('immediate');
+ok($node->start(), 'restart after truncate crash');
+
+is($node->safe_psql('postgres',
+		'SELECT count(*), sum(i), min(i), max(i) FROM umbra_trunc'),
+	'10|1055|101|110',
+	'truncate redo preserved only post-truncate rows');
+
+# Exercise normal mapped writes after crash recovery.  The table should no
+# longer behave as if its logical size were 0, and follow-up restart should
+# keep both the recovered rows and the new rows.
+$node->safe_psql(
+	'postgres', q[
+UPDATE umbra_trunc
+SET i = i + 1000
+WHERE i <= 105;
+INSERT INTO umbra_trunc VALUES (9999);
+CHECKPOINT;
+]);
+
+$node->stop('immediate');
+ok($node->start(), 'restart after post-recovery writes');
+
+is($node->safe_psql('postgres',
+		'SELECT count(*), sum(i), min(i), max(i) FROM umbra_trunc'),
+	'11|16054|106|9999',
+	'post-recovery mapped writes survived second restart');
+
+done_testing();
diff --git a/src/test/recovery/t/071_umbra_skip_wal_dense_map.pl b/src/test/recovery/t/071_umbra_skip_wal_dense_map.pl
new file mode 100644
index 0000000000..a7ca06dd5b
--- /dev/null
+++ b/src/test/recovery/t/071_umbra_skip_wal_dense_map.pl
@@ -0,0 +1,65 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_skip_wal_dense_map');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'minimal'
+autovacuum = off
+shared_buffers = '256MB'
+max_wal_size = '4GB'
+min_wal_size = '1GB'
+checkpoint_timeout = '1h'
+]);
+$node->start();
+
+my $start_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+CREATE TABLE umbra_skipwal_dense AS
+SELECT g::bigint AS id, repeat('x', 200) AS pad
+FROM generate_series(1, 50000) AS g;
+]);
+
+my $count =
+  $node->safe_psql('postgres', q[SELECT count(*) FROM umbra_skipwal_dense;]);
+is($count, '50000', 'skip-WAL-created relation is readable before restart');
+
+my $end_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+my ($dump_stdout, $dump_stderr) = run_command(
+	[
+		'pg_waldump', '-p', $node->data_dir . '/pg_wal',
+		'--start',   $start_lsn,
+		'--end',     $end_lsn
+	]);
+is($dump_stderr, '', 'pg_waldump raw dump completed without stderr');
+
+my @dense_lines =
+  grep { /desc: SKIP_WAL_DENSE_MAP/ }
+  split /\n/, $dump_stdout;
+ok(@dense_lines > 0,
+   'raw WAL dump contains skip-WAL dense MAP records');
+
+my @main_dense_lines =
+  grep { /fork 0 nblocks ([1-9][0-9]*)/ }
+  @dense_lines;
+ok(@main_dense_lines > 0,
+   'skip-WAL dense MAP record carries concrete MAIN fork nblocks');
+
+$node->stop();
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

#10

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 08/10] umbra: add patch 7 checkpoint-boundary FPW replacement and block-reference remap

---
src/backend/access/brin/brin.c | 2 +-
src/backend/access/brin/brin_pageops.c | 4 +-
src/backend/access/brin/brin_revmap.c | 2 +-
src/backend/access/gin/gindatapage.c | 2 +-
src/backend/access/gin/ginfast.c | 8 +-
src/backend/access/gin/ginutil.c | 2 +-
src/backend/access/gist/gistxlog.c | 2 +-
src/backend/access/hash/hashovfl.c | 4 +-
src/backend/access/hash/hashpage.c | 16 +-
src/backend/access/heap/heapam.c | 6 +-
src/backend/access/heap/heapam_handler.c | 10 +-
src/backend/access/nbtree/nbtinsert.c | 8 +-
src/backend/access/nbtree/nbtpage.c | 14 +-
src/backend/access/rmgrdesc/umbradesc.c | 24 +
src/backend/access/rmgrdesc/xlogdesc.c | 24 +
src/backend/access/spgist/spgdoinsert.c | 14 +-
src/backend/access/transam/umbra_xlog.c | 96 +++
src/backend/access/transam/xloginsert.c | 744 +++++++++++++++++-
src/backend/access/transam/xlogreader.c | 40 +
src/backend/access/transam/xlogutils.c | 182 ++++-
src/backend/backup/basebackup.c | 22 +-
src/backend/commands/sequence.c | 6 +-
src/backend/commands/tablecmds.c | 11 +-
src/backend/storage/map/map.c | 11 +
src/backend/storage/map/mapsuper.c | 99 ++-
src/backend/storage/smgr/bulk_write.c | 53 +-
src/backend/storage/smgr/smgr.c | 5 -
src/backend/storage/smgr/umbra.c | 301 ++++++-
src/backend/utils/adt/dbsize.c | 14 +-
src/bin/pg_waldump/.gitignore | 1 +
src/bin/pg_waldump/Makefile | 9 +
src/include/access/umbra_xlog.h | 39 +
src/include/access/xlogreader.h | 11 +
src/include/access/xlogrecord.h | 33 +
src/test/recovery/meson.build | 8 +
.../t/057_umbra_remap_crash_consistency.pl | 74 ++
.../t/058_umbra_2pc_remap_recovery.pl | 90 +++
src/test/recovery/t/067_umbra_remap_redo.pl | 90 +++
...68_umbra_old_baseline_checkpoint_window.pl | 85 ++
.../t/069_umbra_range_remap_zeroextend.pl | 101 +++
.../t/070_umbra_hash_birth_block_remap.pl | 66 ++
.../t/072_umbra_ordinary_slim_block_remap.pl | 69 ++
.../recovery/t/074_umbra_torn_page_remap.pl | 261 ++++++
43 files changed, 2541 insertions(+), 122 deletions(-)
create mode 100644 src/test/recovery/t/057_umbra_remap_crash_consistency.pl
create mode 100644 src/test/recovery/t/058_umbra_2pc_remap_recovery.pl
create mode 100644 src/test/recovery/t/067_umbra_remap_redo.pl
create mode 100644 src/test/recovery/t/068_umbra_old_baseline_checkpoint_window.pl
create mode 100644 src/test/recovery/t/069_umbra_range_remap_zeroextend.pl
create mode 100644 src/test/recovery/t/070_umbra_hash_birth_block_remap.pl
create mode 100644 src/test/recovery/t/072_umbra_ordinary_slim_block_remap.pl
create mode 100644 src/test/recovery/t/074_umbra_torn_page_remap.pl

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e0..9c484f789e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1148,7 +1148,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)

 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfBrinCreateIdx);
-		XLogRegisterBuffer(0, meta, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(0, meta, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

recptr = XLogInsert(RM_BRIN_ID, XLOG_BRIN_CREATE_INDEX);

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 7da97bec43..5acbd7d358 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -283,7 +283,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 			/* new page */
 			XLogRegisterData(&xlrec, SizeOfBrinUpdate);

-			XLogRegisterBuffer(0, newbuf, REGBUF_STANDARD | (extended ? REGBUF_WILL_INIT : 0));
+			XLogRegisterBuffer(0, newbuf, REGBUF_STANDARD | (extended ? REGBUF_WILL_INIT_BIRTH : 0));
 			XLogRegisterBufData(0, newtup, newsz);

/* revmap page */
@@ -435,7 +435,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfBrinInsert);

-		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD | (extended ? REGBUF_WILL_INIT : 0));
+		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD | (extended ? REGBUF_WILL_INIT_BIRTH : 0));
 		XLogRegisterBufData(0, tup, itemsz);

 		XLogRegisterBuffer(1, revmapbuf, 0);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 233355cb2d..951da8f435 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -630,7 +630,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		XLogRegisterData(&xlrec, SizeOfBrinRevmapExtend);
 		XLogRegisterBuffer(0, revmap->rm_metaBuf, REGBUF_STANDARD);

-		XLogRegisterBuffer(1, buf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(1, buf, REGBUF_WILL_INIT_BIRTH);

 		recptr = XLogInsert(RM_BRIN_ID, XLOG_BRIN_REVMAP_EXTEND);
 		PageSetLSN(metapage, recptr);
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index c5d7db2807..e4e6e5ff3a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -1848,7 +1848,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,

 		XLogRegisterData(GinDataLeafPageGetPostingList(page),
 						 rootsize);
-		XLogRegisterBuffer(0, buffer, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buffer, REGBUF_WILL_INIT_BIRTH);

 		recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_PTREE);
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index f50848eb65..c1ee6cc4ab 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -124,7 +124,7 @@ writeListPage(Relation index, Buffer buffer,
 		XLogBeginInsert();
 		XLogRegisterData(&data, sizeof(ginxlogInsertListPage));

-		XLogRegisterBuffer(0, buffer, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buffer, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBufData(0, workspace.data, size);

recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_INSERT_LISTPAGE);
@@ -430,7 +430,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)

memcpy(&data.metadata, metadata, sizeof(GinMetaPageData));

-		XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);
 		XLogRegisterData(&data, sizeof(ginxlogUpdateMeta));

recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE);
@@ -640,9 +640,9 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,

 			XLogBeginInsert();
 			XLogRegisterBuffer(0, metabuffer,
-							   REGBUF_WILL_INIT | REGBUF_STANDARD);
+							   REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);
 			for (i = 0; i < data.ndeleted; i++)
-				XLogRegisterBuffer(i + 1, buffers[i], REGBUF_WILL_INIT);
+				XLogRegisterBuffer(i + 1, buffers[i], REGBUF_WILL_INIT_BIRTH);

memcpy(&data.metadata, metadata, sizeof(GinMetaPageData));

diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d3351fbe8a..1498f210e0 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -641,7 +641,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)

 		XLogBeginInsert();
 		XLogRegisterData(&data, sizeof(ginxlogUpdateMeta));
-		XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

 		recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE);
 		PageSetLSN(metapage, recptr);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index ae538dc81c..65af1fcfb0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -528,7 +528,7 @@ gistXLogSplit(bool page_is_leaf,
 	i = 1;
 	for (ptr = dist; ptr; ptr = ptr->next)
 	{
-		XLogRegisterBuffer(i, ptr->buffer, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(i, ptr->buffer, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBufData(i, &(ptr->block.num), sizeof(int));
 		XLogRegisterBufData(i, ptr->list, ptr->lenlist);
 		i++;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index dbc57ef958..a23a0c501e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -390,7 +390,7 @@ found:
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHashAddOvflPage);

-		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBufData(0, &pageopaque->hasho_bucket, sizeof(Bucket));

XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
@@ -402,7 +402,7 @@ found:
}

 		if (BufferIsValid(newmapbuf))
-			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT_BIRTH);

 		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
 		XLogRegisterBufData(4, &metap->hashm_firstfree, sizeof(uint32));
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8099b0d021..3e78ddfd0f 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -199,6 +199,7 @@ _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
 {
 	BlockNumber nblocks = RelationGetNumberOfBlocksInFork(rel, forkNum);
 	Buffer		buf;
+	bool		extend_path;

if (blkno == P_NEW)
elog(ERROR, "hash AM does not use P_NEW");
@@ -209,6 +210,7 @@ _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
/* smgr insists we explicitly extend the relation */
if (blkno == nblocks)
{
+ extend_path = true;
buf = ExtendBufferedRel(BMR_REL(rel), forkNum, NULL,
EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
if (BufferGetBlockNumber(buf) != blkno)
@@ -217,6 +219,7 @@ _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
}
else
{
+ extend_path = false;
buf = ReadBufferExtended(rel, forkNum, blkno, RBM_ZERO_AND_LOCK,
NULL);
}
@@ -395,7 +398,8 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)

 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHashInitMetaPage);
-		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(0, metabuf,
+						   REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);

@@ -427,9 +431,9 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
MarkBufferDirty(buf);

-		if (use_wal)
-			log_newpage(&rel->rd_locator,
-						forkNum,
+			if (use_wal)
+				log_newpage(&rel->rd_locator,
+							forkNum,
 						blkno,
 						BufferGetPage(buf),
 						true);
@@ -469,7 +473,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)

 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHashInitBitmapPage);
-		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT_BIRTH);

/*
* This is safe only because nobody else can be modifying the index at
@@ -910,7 +914,7 @@ restart_expand:
XLogBeginInsert();

 		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
-		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);

 		if (metap_update_masks)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index abfd8e8970..dbf4826c0d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2111,7 +2111,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 		{
 			info |= XLOG_HEAP_INIT_PAGE;
-			bufflags |= REGBUF_WILL_INIT;
+			bufflags |= REGBUF_WILL_INIT_BIRTH;
 		}

xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
@@ -2561,7 +2561,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
if (init)
{
info |= XLOG_HEAP_INIT_PAGE;
- bufflags |= REGBUF_WILL_INIT;
+ bufflags |= REGBUF_WILL_INIT_BIRTH;
}

/*
@@ -8897,7 +8897,7 @@ log_heap_update(Relation reln, Buffer oldbuf,

bufflags = REGBUF_STANDARD;
if (init)
- bufflags |= REGBUF_WILL_INIT;
+ bufflags |= REGBUF_WILL_INIT_BIRTH;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 20d3b46e06..7c45b50aca 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -569,7 +569,8 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			if (!smgrisinternalfork(forkNum))
+				smgrcreate(dstrel, forkNum, false);

 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -579,11 +580,14 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
 				log_smgrcreate(newrlocator, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+			if (!smgrisinternalfork(forkNum))
+				RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
+									rel->rd_rel->relpersistence);
 		}
 	}

+	smgrcopyrelationmetadata(RelationGetSmgr(rel), dstrel,
+							 rel->rd_rel->relpersistence);

 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index c8af97dd23..e754de9679 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1377,7 +1377,7 @@ _bt_insertonpg(Relation rel,
 					xlmeta.allequalimage = metad->btm_allequalimage;

 					XLogRegisterBuffer(2, metabuf,
-									   REGBUF_WILL_INIT | REGBUF_STANDARD);
+									   REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);
 					XLogRegisterBufData(2, &xlmeta,
 										sizeof(xl_btree_metadata));
 				}
@@ -2011,7 +2011,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 		XLogRegisterData(&xlrec, SizeOfBtreeSplit);

 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-		XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT_BIRTH);
 		/* Log original right sibling, since we've changed its prev-pointer */
 		if (!isrightmost)
 			XLogRegisterBuffer(2, sbuf, REGBUF_STANDARD);
@@ -2612,9 +2612,9 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfBtreeNewroot);

-		XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
-		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

 		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
 		md.version = metad->btm_version;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 0547038616..cb6092da52 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -289,7 +289,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 		xl_btree_metadata md;

 		XLogBeginInsert();
-		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
md.version = metad->btm_version;
@@ -479,8 +479,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
xl_btree_metadata md;

 			XLogBeginInsert();
-			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
-			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT_BIRTH);
+			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
md.version = metad->btm_version;
@@ -2294,7 +2294,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
xlrec.topparent = InvalidBlockNumber;

 		XLogBeginInsert();
-		XLogRegisterBuffer(0, leafbuf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, leafbuf, REGBUF_WILL_INIT_BIRTH);
 		XLogRegisterBuffer(1, subtreeparent, REGBUF_STANDARD);

page = BufferGetPage(leafbuf);
@@ -2713,12 +2713,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,

XLogBeginInsert();

-		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT_BIRTH);
 		if (BufferIsValid(lbuf))
 			XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
 		if (target != leafblkno)
-			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
+			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT_BIRTH);

/* information stored on the target/to-be-unlinked block */
xlrec.leftsib = leftsib;
@@ -2735,7 +2735,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,

 		if (BufferIsValid(metabuf))
 		{
-			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT_BIRTH | REGBUF_STANDARD);

 			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
 			xlmeta.version = metad->btm_version;
diff --git a/src/backend/access/rmgrdesc/umbradesc.c b/src/backend/access/rmgrdesc/umbradesc.c
index 6bad4bb38e..a6b3e6e55e 100644
--- a/src/backend/access/rmgrdesc/umbradesc.c
+++ b/src/backend/access/rmgrdesc/umbradesc.c
@@ -47,6 +47,24 @@ umbra_desc(StringInfo buf, XLogReaderState *record)
 						 path.str, xlrec->lblkno, xlrec->old_pblkno,
 						 xlrec->new_pblkno);
 	}
+	else if (info == XLOG_UMBRA_RANGE_REMAP)
+	{
+		xl_umbra_range_remap *xlrec = (xl_umbra_range_remap *) rec;
+		RelPathStr	path = umbra_fork_relpath(xlrec->rlocator, xlrec->forknum);
+
+		appendStringInfo(buf, "%s count %u end_lblk %u",
+						 path.str, xlrec->count, xlrec->end_lblkno);
+	}
+	else if (info == XLOG_UMBRA_RANGE_REMAP_COMPACT)
+	{
+		xl_umbra_range_remap_compact *xlrec =
+			(xl_umbra_range_remap_compact *) rec;
+		RelPathStr	path = umbra_fork_relpath(xlrec->rlocator, xlrec->forknum);
+
+		appendStringInfo(buf, "%s compact first_lblk %u first_pblk %u count %u",
+						 path.str, xlrec->first_lblkno, xlrec->first_pblkno,
+						 xlrec->count);
+	}
 	else if (info == XLOG_UMBRA_SKIP_WAL_DENSE_MAP)
 	{
 		xl_umbra_skip_wal_dense_map *xlrec =
@@ -72,6 +90,12 @@ umbra_identify(uint8 info)
 		case XLOG_UMBRA_MAP_SET:
 			id = "MAP_SET";
 			break;
+		case XLOG_UMBRA_RANGE_REMAP:
+			id = "RANGE_REMAP";
+			break;
+		case XLOG_UMBRA_RANGE_REMAP_COMPACT:
+			id = "RANGE_REMAP_COMPACT";
+			break;
 		case XLOG_UMBRA_SKIP_WAL_DENSE_MAP:
 			id = "SKIP_WAL_DENSE_MAP";
 			break;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 0fc4f48ca6..2c2e5f06c4 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -382,6 +382,30 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 				}
 			}

+#ifdef USE_UMBRA
+			if (blkref->has_remap)
+			{
+				uint8		remap_format =
+					XLogRecGetInfo(record) & XLR_UMBRA_REMAP_FORMAT_MASK;
+
+				if (remap_format != 0)
+				{
+					appendStringInfo(buf,
+									 "; remap: unsupported format bits 0x%02X",
+									 remap_format);
+				}
+				else
+				{
+					appendStringInfo(buf,
+									 "; remap: old_pblk %u new_pblk %u logical_nblocks %u next_free_pblk %u",
+									 blkref->old_pblkno,
+									 blkref->new_pblkno,
+									 blkref->logical_nblocks,
+									 blkref->next_free_pblkno);
+				}
+			}
+#endif
+
 			if (pretty)
 				appendStringInfoChar(buf, '\n');
 		}
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7c7371c69e..f8ede933e1 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -299,7 +299,7 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,

flags = REGBUF_STANDARD;
if (xlrec.newPage)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(0, current->buffer, flags);
if (xlrec.offnumParent != InvalidOffsetNumber)
XLogRegisterBuffer(1, parent->buffer, REGBUF_STANDARD);
@@ -536,7 +536,7 @@ moveLeafs(Relation index, SpGistState *state,
XLogRegisterData(leafdata, leafptr - leafdata);

 		XLogRegisterBuffer(0, current->buffer, REGBUF_STANDARD);
-		XLogRegisterBuffer(1, nbuf, REGBUF_STANDARD | (xlrec.newPage ? REGBUF_WILL_INIT : 0));
+		XLogRegisterBuffer(1, nbuf, REGBUF_STANDARD | (xlrec.newPage ? REGBUF_WILL_INIT_BIRTH : 0));
 		XLogRegisterBuffer(2, parent->buffer, REGBUF_STANDARD);

recptr = XLogInsert(RM_SPGIST_ID, XLOG_SPGIST_MOVE_LEAFS);
@@ -1377,7 +1377,7 @@ doPickSplit(Relation index, SpGistState *state,
{
flags = REGBUF_STANDARD;
if (xlrec.initSrc)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(0, saveCurrent.buffer, flags);
}

@@ -1386,14 +1386,14 @@ doPickSplit(Relation index, SpGistState *state,
{
flags = REGBUF_STANDARD;
if (xlrec.initDest)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(1, newLeafBuffer, flags);
}

/* Inner page */
flags = REGBUF_STANDARD;
if (xlrec.initInner)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(2, current->buffer, flags);

/* Parent page, if different from inner page */
@@ -1675,7 +1675,7 @@ spgAddNodeAction(Relation index, SpGistState *state,
/* new page */
flags = REGBUF_STANDARD;
if (xlrec.newPage)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(1, current->buffer, flags);
/* parent page (if different from orig and new) */
if (xlrec.parentBlk == 2)
@@ -1874,7 +1874,7 @@ spgSplitNodeAction(Relation index, SpGistState *state,

flags = REGBUF_STANDARD;
if (xlrec.newPage)
- flags |= REGBUF_WILL_INIT;
+ flags |= REGBUF_WILL_INIT_BIRTH;
XLogRegisterBuffer(1, newBuffer, flags);
}

diff --git a/src/backend/access/transam/umbra_xlog.c b/src/backend/access/transam/umbra_xlog.c
index 71c7ad7bb1..186eca102e 100644
--- a/src/backend/access/transam/umbra_xlog.c
+++ b/src/backend/access/transam/umbra_xlog.c
@@ -41,6 +41,54 @@ log_umbra_map_set(RelFileLocator rlocator, ForkNumber forknum,
 	return XLogInsert(RM_UMBRA_ID, XLOG_UMBRA_MAP_SET | XLR_SPECIAL_REL_UPDATE);
 }

+XLogRecPtr
+log_umbra_range_remap(RelFileLocator rlocator, ForkNumber forknum,
+					  uint16 count,
+					  const xl_umbra_range_remap_entry *entries)
+{
+	xl_umbra_range_remap xlrec;
+
+	Assert(count > 0);
+	Assert(entries != NULL);
+
+	xlrec.rlocator = rlocator;
+	xlrec.forknum = forknum;
+	xlrec.count = count;
+	xlrec.padding = 0;
+	xlrec.end_lblkno = entries[count - 1].lblkno;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, offsetof(xl_umbra_range_remap, entries));
+	XLogRegisterData((char *) entries,
+					 sizeof(xl_umbra_range_remap_entry) * count);
+
+	return XLogInsert(RM_UMBRA_ID, XLOG_UMBRA_RANGE_REMAP | XLR_SPECIAL_REL_UPDATE);
+}
+
+XLogRecPtr
+log_umbra_range_remap_compact(RelFileLocator rlocator, ForkNumber forknum,
+							  BlockNumber first_lblkno,
+							  BlockNumber first_pblkno,
+							  uint16 count)
+{
+	xl_umbra_range_remap_compact xlrec;
+
+	Assert(count > 0);
+
+	xlrec.rlocator = rlocator;
+	xlrec.forknum = forknum;
+	xlrec.count = count;
+	xlrec.padding = 0;
+	xlrec.first_lblkno = first_lblkno;
+	xlrec.first_pblkno = first_pblkno;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+	return XLogInsert(RM_UMBRA_ID,
+					  XLOG_UMBRA_RANGE_REMAP_COMPACT | XLR_SPECIAL_REL_UPDATE);
+}
+
 XLogRecPtr
 log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
 							 uint16 count,
@@ -172,6 +220,54 @@ umbra_redo(XLogReaderState *record)
 			}
 			break;

+		case XLOG_UMBRA_RANGE_REMAP:
+			{
+				xl_umbra_range_remap *xlrec;
+				xl_umbra_range_remap_entry *entries;
+				SMgrRelation reln;
+				BlockNumber *pblknos;
+
+				xlrec = (xl_umbra_range_remap *) XLogRecGetData(record);
+				entries = xlrec->entries;
+				reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+
+				if (!UmMetadataExists(reln))
+					break;
+
+				pblknos = palloc(sizeof(BlockNumber) * xlrec->count);
+				for (int i = 0; i < xlrec->count; i++)
+					pblknos[i] = entries[i].new_pblkno;
+
+				UmApplyReservedRangeRemap(reln, xlrec->forknum,
+										  entries[0].lblkno, xlrec->count,
+										  pblknos, record->EndRecPtr, true);
+				pfree(pblknos);
+			}
+			break;
+
+		case XLOG_UMBRA_RANGE_REMAP_COMPACT:
+			{
+				xl_umbra_range_remap_compact *xlrec;
+				SMgrRelation reln;
+				BlockNumber *pblknos;
+
+				xlrec = (xl_umbra_range_remap_compact *) XLogRecGetData(record);
+				reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+
+				if (!UmMetadataExists(reln))
+					break;
+
+				pblknos = palloc(sizeof(BlockNumber) * xlrec->count);
+				for (int i = 0; i < xlrec->count; i++)
+					pblknos[i] = xlrec->first_pblkno + i;
+
+				UmApplyReservedRangeRemap(reln, xlrec->forknum,
+										  xlrec->first_lblkno, xlrec->count,
+										  pblknos, record->EndRecPtr, true);
+				pfree(pblknos);
+			}
+			break;
+
 		case XLOG_UMBRA_SKIP_WAL_DENSE_MAP:
 			{
 				xl_umbra_skip_wal_dense_map *xlrec;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f2e10b82b7..85baf69b2b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -39,6 +39,12 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
+#ifdef USE_UMBRA
+#include "storage/map.h"
+#include "storage/umbra.h"
+#include "storage/umfile.h"
+#endif
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 #include "utils/rel.h"
@@ -85,6 +91,18 @@ typedef struct
 	XLogRecData bkp_rdatas[2];	/* temporary rdatas used to hold references to
 								 * backup block data in XLogRecordAssemble() */

+#ifdef USE_UMBRA
+	bool		has_remap;		/* true if remap metadata is prepared */
+	bool		remap_in_record; /* true if current assembled record includes remap */
+	bool		remap_committed; /* true if mapping switch was committed after insert */
+	bool		wal_owns_firstborn; /* this WAL block owns first-born mapping */
+	SMgrRelation remap_reln;	/* cached relation handle for remap commit */
+	BlockNumber old_pblkno;		/* remap: old physical block number */
+	BlockNumber new_pblkno;		/* remap: new physical block number */
+	BlockNumber remap_logical_nblocks; /* assembled logical frontier payload */
+	BlockNumber remap_next_free_pblkno; /* assembled allocator frontier payload */
+#endif
+
 	/* buffer to store a compressed version of backup block image */
 	char		compressed_page[COMPRESS_BUFSIZE];
 } registered_buffer;
@@ -137,14 +155,159 @@ static bool begininsert_called = false;
 /* Memory context to hold the registered buffer and data references. */
 static MemoryContext xloginsert_cxt;

-static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
-									   XLogRecPtr RedoRecPtr, bool doPageWrites,
-									   XLogRecPtr *fpw_lsn, int *num_fpi,
-									   uint64 *fpi_bytes,
-									   bool *topxid_included);
+#ifndef USE_UMBRA
+static XLogRecData *XLogRecordAssembleMd(RmgrId rmid, uint8 info,
+										 XLogRecPtr RedoRecPtr, bool doPageWrites,
+										 XLogRecPtr *fpw_lsn, int *num_fpi,
+										 uint64 *fpi_bytes,
+										 bool *topxid_included);
+#endif
+#ifdef USE_UMBRA
+static XLogRecData *XLogRecordAssembleUmbra(RmgrId rmid, uint8 info,
+											XLogRecPtr RedoRecPtr, bool doPageWrites,
+											XLogRecPtr *fpw_lsn, int *num_fpi,
+											uint64 *fpi_bytes,
+											bool *topxid_included);
+static void XLogCommitBlockRemapsUmbra(XLogRecPtr record_endptr);
+static void XLogAbortBlockRemapsUmbra(void);
+static void XLogFillBlockRemapFrontierUmbra(registered_buffer *regbuf,
+											XLogRecordBlockRemapHeader *rbmh);
+#else
+static void XLogCommitBlockRemapsMd(XLogRecPtr record_endptr);
+#endif
 static bool XLogCompressBackupBlock(const PageData *page, uint16 hole_offset,
 									uint16 hole_length, void *dest, uint16 *dlen);

+typedef struct xlog_storage_mgr
+{
+	XLogRecData *(*xlog_record_assemble) (RmgrId rmid, uint8 info,
+										  XLogRecPtr RedoRecPtr,
+										  bool doPageWrites,
+										  XLogRecPtr *fpw_lsn, int *num_fpi,
+										  uint64 *fpi_bytes,
+										  bool *topxid_included);
+	void	(*xlog_insert_finish) (XLogRecPtr record_endptr);
+} xlog_storage_mgr;
+
+static const xlog_storage_mgr xlog_storage_mgr_f = {
+#ifdef USE_UMBRA
+	.xlog_record_assemble = XLogRecordAssembleUmbra,
+	.xlog_insert_finish = XLogCommitBlockRemapsUmbra,
+#else
+	.xlog_record_assemble = XLogRecordAssembleMd,
+	.xlog_insert_finish = XLogCommitBlockRemapsMd,
+#endif
+};
+
+#ifdef USE_UMBRA
+static void
+XLogFillBlockRemapFrontierUmbra(registered_buffer *regbuf,
+								XLogRecordBlockRemapHeader *rbmh)
+{
+	SMgrRelation	reln = regbuf->remap_reln;
+	UmbraFileContext *ctx;
+	BlockNumber		logical_nblocks = InvalidBlockNumber;
+	BlockNumber		next_free_pblk = InvalidBlockNumber;
+
+	Assert(rbmh != NULL);
+
+	if (reln == NULL)
+		reln = smgropen(regbuf->rlocator, INVALID_PROC_NUMBER);
+
+	ctx = umfile_ctx_acquire(reln->smgr_rlocator);
+
+	if (MapSBlockTryGetLogicalNblocks(ctx, regbuf->rlocator,
+									  regbuf->forkno, &logical_nblocks))
+		rbmh->logical_nblocks = Max(logical_nblocks, regbuf->block + 1);
+	else
+		rbmh->logical_nblocks = regbuf->block + 1;
+
+	if (MapSBlockTryGetNextFreePhysBlock(ctx, regbuf->rlocator,
+										 regbuf->forkno, &next_free_pblk))
+		rbmh->next_free_pblkno = Max(next_free_pblk,
+									 regbuf->new_pblkno + 1);
+	else
+		rbmh->next_free_pblkno = regbuf->new_pblkno + 1;
+
+	regbuf->remap_logical_nblocks = rbmh->logical_nblocks;
+	regbuf->remap_next_free_pblkno = rbmh->next_free_pblkno;
+}
+
+static void
+XLogCommitBlockRemapsUmbra(XLogRecPtr record_endptr)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id < max_registered_block_id; block_id++)
+	{
+		registered_buffer *regbuf = &registered_buffers[block_id];
+		UmbraFileContext *ctx;
+
+		if (!regbuf->in_use || !regbuf->has_remap || !regbuf->remap_in_record)
+			continue;
+		if (regbuf->remap_reln == NULL)
+			continue;
+
+		ctx = umfile_ctx_acquire(regbuf->remap_reln->smgr_rlocator);
+
+		UmMapSetMapping(regbuf->remap_reln, regbuf->forkno, regbuf->block,
+						regbuf->new_pblkno, record_endptr);
+		if (regbuf->old_pblkno == InvalidBlockNumber)
+		{
+			MapSBlockBumpLogicalNblocks(ctx,
+										regbuf->rlocator,
+										regbuf->forkno,
+										regbuf->remap_logical_nblocks != InvalidBlockNumber ?
+										regbuf->remap_logical_nblocks :
+										regbuf->block + 1,
+										record_endptr);
+			smgrbumpcachednblocks(regbuf->remap_reln,
+								  regbuf->forkno,
+								  regbuf->block + 1);
+		}
+
+		MapSBlockBumpNextFreePhysBlock(ctx,
+									   regbuf->rlocator,
+									   regbuf->forkno,
+									   regbuf->remap_next_free_pblkno != InvalidBlockNumber ?
+									   regbuf->remap_next_free_pblkno :
+									   regbuf->new_pblkno + 1,
+									   record_endptr);
+		MapInflightRelease(regbuf->rlocator, regbuf->forkno,
+						   regbuf->block);
+		regbuf->wal_owns_firstborn = false;
+		regbuf->remap_committed = true;
+	}
+}
+
+static void
+XLogAbortBlockRemapsUmbra(void)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id < max_registered_block_id; block_id++)
+	{
+		registered_buffer *regbuf = &registered_buffers[block_id];
+
+		if (!regbuf->in_use || !regbuf->has_remap)
+			continue;
+		if (regbuf->remap_committed)
+			continue;
+
+		MapInflightRelease(regbuf->rlocator, regbuf->forkno,
+						   regbuf->block);
+		regbuf->wal_owns_firstborn = false;
+		regbuf->remap_in_record = false;
+	}
+}
+#else
+static void
+XLogCommitBlockRemapsMd(XLogRecPtr record_endptr)
+{
+	(void) record_endptr;
+}
+#endif
+
 /*
  * Begin constructing a WAL record. This must be called before the
  * XLogRegister* functions and XLogInsert().
@@ -227,8 +390,25 @@ XLogResetInsertion(void)
 {
 	int			i;

+#ifdef USE_UMBRA
+	XLogAbortBlockRemapsUmbra();
+#endif
+
 	for (i = 0; i < max_registered_block_id; i++)
+	{
 		registered_buffers[i].in_use = false;
+#ifdef USE_UMBRA
+		registered_buffers[i].has_remap = false;
+		registered_buffers[i].remap_in_record = false;
+		registered_buffers[i].remap_committed = false;
+		registered_buffers[i].wal_owns_firstborn = false;
+		registered_buffers[i].remap_reln = NULL;
+		registered_buffers[i].old_pblkno = InvalidBlockNumber;
+		registered_buffers[i].new_pblkno = InvalidBlockNumber;
+		registered_buffers[i].remap_logical_nblocks = InvalidBlockNumber;
+		registered_buffers[i].remap_next_free_pblkno = InvalidBlockNumber;
+#endif
+	}

 	num_rdatas = 0;
 	max_registered_block_id = 0;
@@ -283,6 +463,17 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
 	regbuf->flags = flags;
 	regbuf->rdata_tail = (XLogRecData *) &regbuf->rdata_head;
 	regbuf->rdata_len = 0;
+#ifdef USE_UMBRA
+	regbuf->has_remap = false;
+	regbuf->remap_in_record = false;
+	regbuf->remap_committed = false;
+	regbuf->wal_owns_firstborn = false;
+	regbuf->remap_reln = NULL;
+	regbuf->old_pblkno = InvalidBlockNumber;
+	regbuf->new_pblkno = InvalidBlockNumber;
+	regbuf->remap_logical_nblocks = InvalidBlockNumber;
+	regbuf->remap_next_free_pblkno = InvalidBlockNumber;
+#endif

 	/*
 	 * Check that this page hasn't already been registered with some other
@@ -336,6 +527,17 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
 	regbuf->flags = flags;
 	regbuf->rdata_tail = (XLogRecData *) &regbuf->rdata_head;
 	regbuf->rdata_len = 0;
+#ifdef USE_UMBRA
+	regbuf->has_remap = false;
+	regbuf->remap_in_record = false;
+	regbuf->remap_committed = false;
+	regbuf->wal_owns_firstborn = false;
+	regbuf->remap_reln = NULL;
+	regbuf->old_pblkno = InvalidBlockNumber;
+	regbuf->new_pblkno = InvalidBlockNumber;
+	regbuf->remap_logical_nblocks = InvalidBlockNumber;
+	regbuf->remap_next_free_pblkno = InvalidBlockNumber;
+#endif

/*
* Check that this page hasn't already been registered with some other
@@ -509,30 +711,37 @@ XLogInsert(RmgrId rmid, uint8 info)
return EndPos;
}

-	do
+	PG_TRY();
 	{
-		XLogRecPtr	RedoRecPtr;
-		bool		doPageWrites;
-		bool		topxid_included = false;
-		XLogRecPtr	fpw_lsn;
-		XLogRecData *rdt;
-		int			num_fpi = 0;
-		uint64		fpi_bytes = 0;
-
-		/*
-		 * Get values needed to decide whether to do full-page writes. Since
-		 * we don't yet have an insertion lock, these could change under us,
-		 * but XLogInsertRecord will recheck them once it has a lock.
-		 */
-		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
-
-		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
-								 &fpw_lsn, &num_fpi, &fpi_bytes,
-								 &topxid_included);
-
-		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
-								  fpi_bytes, topxid_included);
-	} while (!XLogRecPtrIsValid(EndPos));
+		do
+		{
+			XLogRecPtr	RedoRecPtr;
+			bool		doPageWrites;
+			bool		topxid_included = false;
+			XLogRecPtr	fpw_lsn;
+			XLogRecData *rdt;
+			int			num_fpi = 0;
+			uint64		fpi_bytes = 0;
+
+			GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+			rdt = xlog_storage_mgr_f.xlog_record_assemble(rmid, info, RedoRecPtr,
+														  doPageWrites, &fpw_lsn,
+														  &num_fpi, &fpi_bytes,
+														  &topxid_included);
+
+			EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
+									  fpi_bytes, topxid_included);
+		} while (!XLogRecPtrIsValid(EndPos));
+
+		xlog_storage_mgr_f.xlog_insert_finish(EndPos);
+	}
+	PG_CATCH();
+	{
+		XLogResetInsertion();
+		PG_RE_THROW();
+	}
+	PG_END_TRY();

XLogResetInsertion();

@@ -617,11 +826,12 @@ XLogGetFakeLSN(Relation rel)
  * *topxid_included is set if the topmost transaction ID is logged with the
  * current subtransaction.
  */
+#ifndef USE_UMBRA
 static XLogRecData *
-XLogRecordAssemble(RmgrId rmid, uint8 info,
-				   XLogRecPtr RedoRecPtr, bool doPageWrites,
-				   XLogRecPtr *fpw_lsn, int *num_fpi, uint64 *fpi_bytes,
-				   bool *topxid_included)
+XLogRecordAssembleMd(RmgrId rmid, uint8 info,
+					 XLogRecPtr RedoRecPtr, bool doPageWrites,
+					 XLogRecPtr *fpw_lsn, int *num_fpi, uint64 *fpi_bytes,
+					 bool *topxid_included)
 {
 	XLogRecData *rdt;
 	uint64		total_len = 0;
@@ -1009,6 +1219,470 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,

 	return &hdr_rdt;
 }
+#endif
+
+#ifdef USE_UMBRA
+static XLogRecData *
+XLogRecordAssembleUmbra(RmgrId rmid, uint8 info,
+						XLogRecPtr RedoRecPtr, bool doPageWrites,
+						XLogRecPtr *fpw_lsn, int *num_fpi,
+						uint64 *fpi_bytes,
+						bool *topxid_included)
+{
+	XLogRecData *rdt;
+	uint64		total_len = 0;
+	int			block_id;
+	pg_crc32c	rdata_crc;
+	bool		needs_backup_by_block[XLR_MAX_BLOCK_ID + 1] = {0};
+	bool		needs_data_by_block[XLR_MAX_BLOCK_ID + 1] = {0};
+	bool		include_image_by_block[XLR_MAX_BLOCK_ID + 1] = {0};
+	bool		include_remap_by_block[XLR_MAX_BLOCK_ID + 1] = {0};
+	registered_buffer *prev_regbuf = NULL;
+	XLogRecData *rdt_datas_last;
+	XLogRecord *rechdr;
+	char	   *scratch = hdr_scratch;
+
+	rechdr = (XLogRecord *) scratch;
+	scratch += SizeOfXLogRecord;
+
+	hdr_rdt.next = NULL;
+	rdt_datas_last = &hdr_rdt;
+	hdr_rdt.data = hdr_scratch;
+
+	if (wal_consistency_checking[rmid])
+		info |= XLR_CHECK_CONSISTENCY;
+
+	*fpw_lsn = InvalidXLogRecPtr;
+	for (block_id = 0; block_id < max_registered_block_id; block_id++)
+	{
+		registered_buffer *regbuf = &registered_buffers[block_id];
+		bool		needs_backup;
+		bool		needs_remap;
+		XLogRecordBlockRemapHeader rbmh;
+		bool		include_image;
+		bool		include_remap = false;
+		bool		wal_owned_remap_available = false;
+		SMgrRelation reln = NULL;
+
+		if (!regbuf->in_use)
+			continue;
+
+		regbuf->remap_in_record = false;
+		regbuf->wal_owns_firstborn = false;
+
+		if (regbuf->flags & REGBUF_FORCE_IMAGE)
+		{
+			needs_backup = true;
+			needs_remap = false;
+		}
+		else if (regbuf->flags & REGBUF_NO_IMAGE)
+		{
+			needs_backup = false;
+			needs_remap = false;
+		}
+		else if (!doPageWrites)
+		{
+			needs_backup = false;
+			needs_remap = false;
+		}
+		else
+		{
+			XLogRecPtr	page_lsn = PageGetLSN(regbuf->page);
+
+			if (rmid != RM_XLOG_ID || info != XLOG_FPI_FOR_HINT)
+			{
+				reln = smgropen(regbuf->rlocator, INVALID_PROC_NUMBER);
+				regbuf->remap_reln = reln;
+				wal_owned_remap_available =
+					UmWalOwnedRemapAvailable(reln, regbuf->forkno);
+			}
+
+			if (!wal_owned_remap_available ||
+				(rmid == RM_XLOG_ID && info == XLOG_FPI_FOR_HINT))
+			{
+				needs_backup = (page_lsn <= RedoRecPtr);
+				needs_remap = false;
+			}
+			else
+			{
+				needs_backup = false;
+				needs_remap = (page_lsn <= RedoRecPtr);
+			}
+			if (!needs_backup && !needs_remap)
+			{
+				if (!XLogRecPtrIsValid(*fpw_lsn) || page_lsn < *fpw_lsn)
+					*fpw_lsn = page_lsn;
+			}
+		}
+
+		Assert(!(needs_backup && needs_remap));
+
+		if (!regbuf->has_remap &&
+			(regbuf->flags & REGBUF_LOGICAL_BIRTH) != 0)
+		{
+			bool		got_mapping;
+
+			if (reln == NULL)
+			{
+				reln = smgropen(regbuf->rlocator, INVALID_PROC_NUMBER);
+				regbuf->remap_reln = reln;
+			}
+			if (!UmWalOwnedFirstbornAvailable(reln, regbuf->forkno,
+											 regbuf->block))
+				goto remap_birth_done;
+
+			/*
+			 * For WAL-owned first-born, the producer must have already created
+			 * any needed birth claim through smgrextend()/smgrzeroextend().
+			 *
+			 * We first probe committed MAP state. If the mapping is still
+			 * private to this backend, UmMapReserveFreshPbkno() must reuse the
+			 * owner-local in-flight claim instead of reserving a second pblk.
+			 *
+			 * Keep this order aligned with bulk_write.c: claim birth first,
+			 * then let page WAL own the final commit of that same claim.
+			 */
+			got_mapping = UmMapTryLookupPblkno(reln, regbuf->forkno,
+											  regbuf->block,
+											  &regbuf->new_pblkno);
+			if (!got_mapping)
+			{
+				UmMapReserveFreshPbkno(reln, regbuf->forkno,
+									   regbuf->block,
+									   &regbuf->new_pblkno);
+				regbuf->old_pblkno = InvalidBlockNumber;
+				regbuf->has_remap = true;
+				regbuf->remap_committed = false;
+			}
+	remap_birth_done:
+			;
+		}
+
+		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
+		if (regbuf->has_remap)
+			include_remap = regbuf->has_remap;
+		else if (needs_remap)
+		{
+			if (reln == NULL)
+			{
+				reln = smgropen(regbuf->rlocator, INVALID_PROC_NUMBER);
+				regbuf->remap_reln = reln;
+			}
+			UmMapGetNewPbkno(reln, regbuf->forkno, regbuf->block,
+							 &regbuf->new_pblkno,
+							 &regbuf->old_pblkno);
+
+			regbuf->has_remap = true;
+			regbuf->remap_committed = false;
+			include_remap = true;
+
+			if (include_remap &&
+				regbuf->new_pblkno == regbuf->old_pblkno)
+				elog(PANIC,
+					 "remap decision produced unchanged pblk for %u/%u/%u fork %u block %u",
+					 regbuf->rlocator.spcOid,
+					 regbuf->rlocator.dbOid,
+					 regbuf->rlocator.relNumber,
+					 regbuf->forkno,
+					 regbuf->block);
+		}
+
+		if (include_remap)
+		{
+			rbmh.old_pblkno = regbuf->old_pblkno;
+			rbmh.new_pblkno = regbuf->new_pblkno;
+			XLogFillBlockRemapFrontierUmbra(regbuf, &rbmh);
+
+			if ((regbuf->flags & REGBUF_FORCE_IMAGE) == 0 &&
+				(info & XLR_CHECK_CONSISTENCY) == 0)
+				include_image = false;
+		}
+
+		if (regbuf->rdata_len == 0)
+			needs_data_by_block[block_id] = false;
+		else if ((regbuf->flags & REGBUF_KEEP_DATA) != 0)
+			needs_data_by_block[block_id] = true;
+		else
+			needs_data_by_block[block_id] = (!needs_backup || include_remap);
+
+		needs_backup_by_block[block_id] = needs_backup;
+		include_image_by_block[block_id] = include_image;
+		include_remap_by_block[block_id] = include_remap;
+	}
+
+	for (block_id = 0; block_id < max_registered_block_id; block_id++)
+	{
+		registered_buffer *regbuf = &registered_buffers[block_id];
+		bool		needs_backup;
+		bool		needs_data;
+		XLogRecordBlockHeader bkpb;
+		XLogRecordBlockRemapHeader rbmh;
+		XLogRecordBlockImageHeader bimg;
+		XLogRecordBlockCompressHeader cbimg = {0};
+		bool		samerel;
+		bool		is_compressed = false;
+		bool		include_image;
+		bool		include_remap;
+
+		if (!regbuf->in_use)
+			continue;
+
+		needs_backup = needs_backup_by_block[block_id];
+		needs_data = needs_data_by_block[block_id];
+		include_image = include_image_by_block[block_id];
+		include_remap = include_remap_by_block[block_id];
+
+		regbuf->remap_in_record = false;
+		regbuf->wal_owns_firstborn = false;
+
+		bkpb.id = block_id;
+		bkpb.fork_flags = regbuf->forkno;
+		bkpb.data_length = 0;
+
+		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
+			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
+
+		if (include_remap)
+		{
+			bkpb.fork_flags |= BKPBLOCK_HAS_REMAP;
+			regbuf->remap_in_record = true;
+		}
+
+		if (include_image)
+		{
+			const PageData *page = regbuf->page;
+			uint16		compressed_len = 0;
+
+			if (regbuf->flags & REGBUF_STANDARD)
+			{
+				uint16		lower = ((PageHeader) page)->pd_lower;
+				uint16		upper = ((PageHeader) page)->pd_upper;
+
+				if (lower >= SizeOfPageHeaderData &&
+					upper > lower &&
+					upper <= BLCKSZ)
+				{
+					bimg.hole_offset = lower;
+					cbimg.hole_length = upper - lower;
+				}
+				else
+				{
+					bimg.hole_offset = 0;
+					cbimg.hole_length = 0;
+				}
+			}
+			else
+			{
+				bimg.hole_offset = 0;
+				cbimg.hole_length = 0;
+			}
+
+			if (wal_compression != WAL_COMPRESSION_NONE)
+			{
+				is_compressed =
+					XLogCompressBackupBlock(page, bimg.hole_offset,
+											cbimg.hole_length,
+											regbuf->compressed_page,
+											&compressed_len);
+			}
+
+			bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
+			*num_fpi += 1;
+
+			rdt_datas_last->next = &regbuf->bkp_rdatas[0];
+			rdt_datas_last = rdt_datas_last->next;
+
+			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
+
+			if (needs_backup)
+				bimg.bimg_info |= BKPIMAGE_APPLY;
+
+			if (is_compressed)
+			{
+				bimg.length = compressed_len;
+
+				switch ((WalCompression) wal_compression)
+				{
+					case WAL_COMPRESSION_PGLZ:
+						bimg.bimg_info |= BKPIMAGE_COMPRESS_PGLZ;
+						break;
+
+					case WAL_COMPRESSION_LZ4:
+#ifdef USE_LZ4
+						bimg.bimg_info |= BKPIMAGE_COMPRESS_LZ4;
+#else
+						elog(ERROR, "LZ4 is not supported by this build");
+#endif
+						break;
+
+					case WAL_COMPRESSION_ZSTD:
+#ifdef USE_ZSTD
+						bimg.bimg_info |= BKPIMAGE_COMPRESS_ZSTD;
+#else
+						elog(ERROR, "zstd is not supported by this build");
+#endif
+						break;
+
+					case WAL_COMPRESSION_NONE:
+						Assert(false);
+						break;
+				}
+
+				rdt_datas_last->data = regbuf->compressed_page;
+				rdt_datas_last->len = compressed_len;
+			}
+			else
+			{
+				bimg.length = BLCKSZ - cbimg.hole_length;
+
+				if (cbimg.hole_length == 0)
+				{
+					rdt_datas_last->data = page;
+					rdt_datas_last->len = BLCKSZ;
+				}
+				else
+				{
+					rdt_datas_last->data = page;
+					rdt_datas_last->len = bimg.hole_offset;
+
+					rdt_datas_last->next = &regbuf->bkp_rdatas[1];
+					rdt_datas_last = rdt_datas_last->next;
+
+					rdt_datas_last->data =
+						page + (bimg.hole_offset + cbimg.hole_length);
+					rdt_datas_last->len =
+						BLCKSZ - (bimg.hole_offset + cbimg.hole_length);
+				}
+			}
+
+			total_len += bimg.length;
+			*fpi_bytes += bimg.length;
+		}
+
+		if (needs_data)
+		{
+			Assert(regbuf->rdata_len <= UINT16_MAX);
+
+			bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
+			bkpb.data_length = (uint16) regbuf->rdata_len;
+			total_len += regbuf->rdata_len;
+
+			rdt_datas_last->next = regbuf->rdata_head;
+			rdt_datas_last = regbuf->rdata_tail;
+		}
+
+		if (prev_regbuf && RelFileLocatorEquals(regbuf->rlocator, prev_regbuf->rlocator))
+		{
+			samerel = true;
+			bkpb.fork_flags |= BKPBLOCK_SAME_REL;
+		}
+		else
+			samerel = false;
+		prev_regbuf = regbuf;
+
+		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
+		scratch += SizeOfXLogRecordBlockHeader;
+		if (include_remap)
+		{
+			rbmh.old_pblkno = regbuf->old_pblkno;
+			rbmh.new_pblkno = regbuf->new_pblkno;
+			rbmh.logical_nblocks = regbuf->remap_logical_nblocks;
+			rbmh.next_free_pblkno = regbuf->remap_next_free_pblkno;
+			memcpy(scratch, &rbmh, SizeOfXLogRecordBlockRemapHeader);
+			scratch += SizeOfXLogRecordBlockRemapHeader;
+		}
+		if (include_image)
+		{
+			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
+			scratch += SizeOfXLogRecordBlockImageHeader;
+			if (cbimg.hole_length != 0 && is_compressed)
+			{
+				memcpy(scratch, &cbimg,
+					   SizeOfXLogRecordBlockCompressHeader);
+				scratch += SizeOfXLogRecordBlockCompressHeader;
+			}
+		}
+		if (!samerel)
+		{
+			memcpy(scratch, &regbuf->rlocator, sizeof(RelFileLocator));
+			scratch += sizeof(RelFileLocator);
+		}
+		memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
+		scratch += sizeof(BlockNumber);
+	}
+
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) &&
+		replorigin_xact_state.origin != InvalidReplOriginId)
+	{
+		*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
+		memcpy(scratch, &replorigin_xact_state.origin, sizeof(replorigin_xact_state.origin));
+		scratch += sizeof(replorigin_xact_state.origin);
+	}
+
+	if (IsSubxactTopXidLogPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		*topxid_included = true;
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
+	if (mainrdata_len > 0)
+	{
+		if (mainrdata_len > 255)
+		{
+			uint32 mainrdata_len_4b;
+
+			if (mainrdata_len > PG_UINT32_MAX)
+				ereport(ERROR,
+						(errmsg_internal("too much WAL data"),
+						 errdetail_internal("Main data length is %" PRIu64 " bytes for a maximum of %u bytes.",
+											mainrdata_len,
+											PG_UINT32_MAX)));
+
+			mainrdata_len_4b = (uint32) mainrdata_len;
+			*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
+			memcpy(scratch, &mainrdata_len_4b, sizeof(uint32));
+			scratch += sizeof(uint32);
+		}
+		else
+		{
+			*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
+			*(scratch++) = (uint8) mainrdata_len;
+		}
+		rdt_datas_last->next = mainrdata_head;
+		rdt_datas_last = mainrdata_last;
+		total_len += mainrdata_len;
+	}
+	rdt_datas_last->next = NULL;
+
+	hdr_rdt.len = (scratch - hdr_scratch);
+	total_len += hdr_rdt.len;
+
+	INIT_CRC32C(rdata_crc);
+	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
+	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
+		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+
+	if (total_len > XLogRecordMaxSize)
+		ereport(ERROR,
+				(errmsg_internal("oversized WAL record"),
+				 errdetail_internal("WAL record would be %" PRIu64 " bytes (of maximum %u bytes); rmid %u flags %u.",
+									total_len, XLogRecordMaxSize, rmid, info)));
+
+	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
+	rechdr->xl_tot_len = (uint32) total_len;
+	rechdr->xl_info = info;
+	rechdr->xl_rmid = rmid;
+	rechdr->xl_prev = InvalidXLogRecPtr;
+	rechdr->xl_crc = rdata_crc;
+
+	return &hdr_rdt;
+}
+#endif

/*
* Create a compressed version of a backup block image.
@@ -1194,7 +1868,7 @@ log_newpage(RelFileLocator *rlocator, ForkNumber forknum, BlockNumber blkno,
int flags;
XLogRecPtr recptr;

-	flags = REGBUF_FORCE_IMAGE;
+	flags = REGBUF_FORCE_IMAGE | REGBUF_LOGICAL_BIRTH;
 	if (page_std)
 		flags |= REGBUF_STANDARD;

@@ -1228,7 +1902,7 @@ log_newpages(RelFileLocator *rlocator, ForkNumber forknum, int num_pages,
int i;
int j;

-	flags = REGBUF_FORCE_IMAGE;
+	flags = REGBUF_FORCE_IMAGE | REGBUF_LOGICAL_BIRTH;
 	if (page_std)
 		flags |= REGBUF_STANDARD;

@@ -1322,7 +1996,7 @@ log_newpage_range(Relation rel, ForkNumber forknum,
int flags;
BlockNumber blkno;

-	flags = REGBUF_FORCE_IMAGE;
+	flags = REGBUF_FORCE_IMAGE | REGBUF_LOGICAL_BIRTH;
 	if (page_std)
 		flags |= REGBUF_STANDARD;

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8849610db0..ae9c2c7802 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1795,10 +1795,20 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
+#ifdef USE_UMBRA
+			blk->has_remap = false;
+			blk->old_pblkno = InvalidBlockNumber;
+			blk->new_pblkno = InvalidBlockNumber;
+			blk->logical_nblocks = InvalidBlockNumber;
+			blk->next_free_pblkno = InvalidBlockNumber;
+#endif

 			COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
 			blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
 			blk->flags = fork_flags;
+#ifdef USE_UMBRA
+			blk->has_remap = ((fork_flags & BKPBLOCK_HAS_REMAP) != 0);
+#endif
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);

@@ -1823,6 +1833,36 @@ DecodeXLogRecord(XLogReaderState *state,
}
datatotal += blk->data_len;

+#ifdef USE_UMBRA
+			if (blk->has_remap)
+			{
+				uint8		remap_format =
+					decoded->header.xl_info & XLR_UMBRA_REMAP_FORMAT_MASK;
+
+				if (remap_format != 0)
+				{
+					report_invalid_record(state,
+										  "unsupported remap format bits 0x%02X at %X/%X",
+										  remap_format,
+										  LSN_FORMAT_ARGS(state->ReadRecPtr));
+					goto err;
+				}
+
+				COPY_HEADER_FIELD(&blk->old_pblkno, sizeof(BlockNumber));
+				COPY_HEADER_FIELD(&blk->new_pblkno, sizeof(BlockNumber));
+				COPY_HEADER_FIELD(&blk->logical_nblocks, sizeof(BlockNumber));
+				COPY_HEADER_FIELD(&blk->next_free_pblkno, sizeof(BlockNumber));
+			}
+#else
+			if (fork_flags & BKPBLOCK_HAS_REMAP)
+			{
+				report_invalid_record(state,
+									  "BKPBLOCK_HAS_REMAP is not allowed in this storage mode at %X/%X",
+									  LSN_FORMAT_ARGS(state->ReadRecPtr));
+				goto err;
+			}
+#endif
+
 			if (blk->has_image)
 			{
 				COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f32aac5476..3090a2eb47 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -118,6 +118,10 @@ static bool XLogUmbraEnsureMappedBlockForRedo(RelFileLocator rlocator,
 											  BlockNumber blkno);
 static bool XLogUmbraEnsureMetadataForRedo(RelFileLocator rlocator,
 										   ForkNumber forknum);
+static void XLogUmbraLockRedoBuffer(Buffer buf, ReadBufferMode mode,
+									bool get_cleanup_lock);
+static inline bool XLogBlockRemapRedoImageEnabled(void);
+static inline bool XLogBlockRemapRedoNoImageEnabled(void);
 #endif

/* Report a reference to an invalid page */
@@ -421,6 +425,25 @@ XLogCheckInvalidPages(void)
}

 #ifdef USE_UMBRA
+static inline bool
+XLogBlockRemapRedoImageEnabled(void)
+{
+	/*
+	 * Phase-1: close remap+image redo first.
+	 * This path is deterministic and does not require old->new baseline switch.
+	 */
+	return true;
+}
+
+static inline bool
+XLogBlockRemapRedoNoImageEnabled(void)
+{
+	/*
+	 * No-image remap redo uses deterministic old->new baseline handoff.
+	 */
+	return true;
+}
+
 /*
  * Redo is an owner point for handle-local Umbra MAP state.
  *
@@ -501,6 +524,17 @@ XLogUmbraEnsureMetadataForRedo(RelFileLocator rlocator, ForkNumber forknum)
 	return true;
 }

+static void
+XLogUmbraLockRedoBuffer(Buffer buf, ReadBufferMode mode, bool get_cleanup_lock)
+{
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		return;
+
+	if (get_cleanup_lock)
+		LockBufferForCleanup(buf);
+	else
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+}
 #endif

@@ -679,6 +713,10 @@ XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
+	bool		has_remap;
+	bool		has_image;
+	DecodedBkpBlock *blk;
+	SMgrRelation remap_smgr = NULL;

if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno,
&prefetch_buffer))
@@ -696,11 +734,92 @@ XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
if (!willinit && zeromode)
elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");

+	blk = XLogRecGetBlock(record, block_id);
+	has_remap = XLogRecBlockHasRemap(record, block_id);
+	has_image = XLogRecBlockImageApply(record, block_id);
+
+	if (!has_remap)
+	{
+		if (!XLogUmbraEnsureMetadataForRedo(rlocator, forknum))
+			return BLK_NOTFOUND;
+
+		if (has_image)
+		{
+			Assert(XLogRecHasBlockImage(record, block_id));
+			*buf = XLogReadBufferExtended(rlocator, forknum, blkno,
+										  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+										  prefetch_buffer);
+			page = BufferGetPage(*buf);
+			if (!RestoreBlockImage(record, block_id, page))
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", record->errormsg_buf)));
+
+			if (!PageIsNew(page))
+				PageSetLSN(page, lsn);
+
+			MarkBufferDirty(*buf);
+			if (forknum == INIT_FORKNUM)
+				FlushOneBuffer(*buf);
+
+			return BLK_RESTORED;
+		}
+
+		*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode,
+									  prefetch_buffer);
+		if (BufferIsValid(*buf))
+		{
+			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
+			{
+				if (get_cleanup_lock)
+					LockBufferForCleanup(*buf);
+				else
+					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
+			}
+			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+				return BLK_DONE;
+			return BLK_NEEDS_REDO;
+		}
+		return BLK_NOTFOUND;
+	}
+
 	if (!XLogUmbraEnsureMetadataForRedo(rlocator, forknum))
 		return BLK_NOTFOUND;
+	remap_smgr = smgropen(rlocator, INVALID_PROC_NUMBER);

-	if (XLogRecBlockImageApply(record, block_id))
+	if (has_image)
 	{
+		UmbraFileContext *ctx = umfile_ctx_acquire(remap_smgr->smgr_rlocator);
+		BlockNumber redo_logical_nblocks;
+		BlockNumber redo_next_free_pblkno;
+
+		if (!XLogBlockRemapRedoImageEnabled())
+			elog(PANIC,
+				 "encountered remap-with-image WAL record before phase-1 redo is enabled for %u/%u/%u fork %d block %u",
+				 rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
+				 forknum, blkno);
+
+		redo_logical_nblocks =
+			(blk->logical_nblocks != InvalidBlockNumber) ?
+			blk->logical_nblocks : blkno + 1;
+		redo_next_free_pblkno =
+			(blk->next_free_pblkno != InvalidBlockNumber) ?
+			blk->next_free_pblkno : blk->new_pblkno + 1;
+
+		UmMapSetMapping(remap_smgr, forknum, blkno, blk->new_pblkno, lsn);
+		MapSBlockBumpNextFreePhysBlock(ctx, rlocator,
+									   forknum, redo_next_free_pblkno,
+									   lsn);
+		if (blk->old_pblkno == InvalidBlockNumber)
+		{
+			MapSBlockBumpLogicalNblocks(ctx, rlocator,
+										forknum, redo_logical_nblocks,
+										lsn);
+			smgrbumpcachednblocks(remap_smgr, forknum, blkno + 1);
+		}
+		if (!XLogUmbraEnsureMappedBlockForRedo(rlocator, forknum, blkno))
+			return BLK_NOTFOUND;
+
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rlocator, forknum, blkno,
 									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
@@ -721,6 +840,56 @@ XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
 		return BLK_RESTORED;
 	}

+	if (!XLogBlockRemapRedoNoImageEnabled())
+		elog(PANIC,
+			 "encountered remap-without-image WAL record before phase-2 redo is enabled for %u/%u/%u fork %d block %u",
+			 rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
+			 forknum, blkno);
+
+	if (zeromode)
+	{
+		UmbraFileContext *ctx = umfile_ctx_acquire(remap_smgr->smgr_rlocator);
+		BlockNumber redo_logical_nblocks;
+		BlockNumber redo_next_free_pblkno;
+
+		Assert(willinit);
+
+		redo_logical_nblocks =
+			(blk->logical_nblocks != InvalidBlockNumber) ?
+			blk->logical_nblocks : blkno + 1;
+		redo_next_free_pblkno =
+			(blk->next_free_pblkno != InvalidBlockNumber) ?
+			blk->next_free_pblkno : blk->new_pblkno + 1;
+
+		UmMapSetMapping(remap_smgr, forknum, blkno, blk->new_pblkno, lsn);
+		MapSBlockBumpNextFreePhysBlock(ctx, rlocator,
+									   forknum, redo_next_free_pblkno,
+									   lsn);
+		if (blk->old_pblkno == InvalidBlockNumber)
+		{
+			MapSBlockBumpLogicalNblocks(ctx, rlocator,
+										forknum, redo_logical_nblocks,
+										lsn);
+			smgrbumpcachednblocks(remap_smgr, forknum, blkno + 1);
+		}
+		if (!XLogUmbraEnsureMappedBlockForRedo(rlocator, forknum, blkno))
+			return BLK_NOTFOUND;
+
+		*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode,
+									  prefetch_buffer);
+		if (BufferIsValid(*buf))
+			return BLK_NEEDS_REDO;
+		return BLK_NOTFOUND;
+	}
+
+	Assert(!willinit);
+	if (blk->old_pblkno == InvalidBlockNumber)
+		elog(PANIC,
+			 "remap-without-image record has invalid old pblk for %u/%u/%u fork %d block %u",
+			 rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
+			 forknum, blkno);
+
+	UmMapSetMapping(remap_smgr, forknum, blkno, blk->old_pblkno, lsn);
 	if (!XLogUmbraEnsureMappedBlockForRedo(rlocator, forknum, blkno))
 		return BLK_NOTFOUND;

@@ -729,6 +898,17 @@ XLogReadBufferForRedoExtendedUmbra(XLogReaderState *record,
if (!BufferIsValid(*buf))
return BLK_NOTFOUND;

+	XLogUmbraLockRedoBuffer(*buf, mode, get_cleanup_lock);
+	MarkBufferDirty(*buf);
+	FlushOneBuffer(*buf);
+
+	UmMapSetMapping(remap_smgr, forknum, blkno, blk->new_pblkno, lsn);
+	if (blk->next_free_pblkno != InvalidBlockNumber)
+		MapSBlockBumpNextFreePhysBlock(umfile_ctx_acquire(remap_smgr->smgr_rlocator),
+									   rlocator,
+									   forknum, blk->next_free_pblkno,
+									   lsn);
+
 	if (lsn <= PageGetLSN(BufferGetPage(*buf)))
 		return BLK_DONE;
 	return BLK_NEEDS_REDO;
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 9c79dadaac..956039aa15 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -94,6 +94,7 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
 static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 					 struct stat *statbuf, bool missing_ok,
 					 Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+					 ForkNumber relForkNum,
 					 unsigned segno,
 					 backup_manifest_info *manifest,
 					 unsigned num_incremental_blocks,
@@ -364,7 +365,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 									XLOG_CONTROL_FILE)));
 				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
 						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+						 InvalidRelFileNumber, InvalidForkNumber, 0, &manifest,
+						 0, NULL, 0);
 			}
 			else
 			{
@@ -630,7 +632,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 						 errmsg("could not stat file \"%s\": %m", pathbuf)));

 			sendFile(sink, pathbuf, pathbuf, &statbuf, false,
-					 InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+					 InvalidOid, InvalidOid, InvalidRelFileNumber,
+					 InvalidForkNumber, 0,
 					 &manifest, 0, NULL, 0);

 			/* unconditionally mark file as archived */
@@ -1526,7 +1529,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 			if (!sizeonly)
 				sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
 								true, dboid, spcoid,
-								relfilenumber, segno, manifest,
+								relfilenumber, relForkNum, segno, manifest,
 								num_blocks_required,
 								method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
 								truncation_block_length);
@@ -1575,7 +1578,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 static bool
 sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		 struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
-		 RelFileNumber relfilenumber, unsigned segno,
+		 RelFileNumber relfilenumber, ForkNumber relForkNum, unsigned segno,
 		 backup_manifest_info *manifest, unsigned num_incremental_blocks,
 		 BlockNumber *incremental_blocks, unsigned truncation_block_length)
 {
@@ -1617,7 +1620,16 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	 * or disabled as that might change, thus we check at each point where we
 	 * could be validating a checksum.
 	 */
-	if (!noverify_checksums && RelFileNumberIsValid(relfilenumber))
+	if (!noverify_checksums && RelFileNumberIsValid(relfilenumber)
+#ifdef USE_UMBRA
+		/*
+		 * Umbra mapped forks are copied in physical block order during base
+		 * backup, but page checksums stay keyed by logical block number.
+		 * INIT forks remain direct-mapped and can still be verified here.
+		 */
+		&& relForkNum == INIT_FORKNUM
+#endif
+		)
 		verify_checksum = true;

 	/*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 551667650b..e5b2583b47 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -403,7 +403,7 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
 		XLogRecPtr	recptr;

 		XLogBeginInsert();
-		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT_BIRTH);

xlrec.locator = rel->rd_locator;

@@ -832,7 +832,7 @@ nextval_internal(Oid relid, bool check_permissions)
 		 * sequence values if we crash.
 		 */
 		XLogBeginInsert();
-		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT_BIRTH);

/* set values that will be saved in xlog */
seq->last_value = next;
@@ -1024,7 +1024,7 @@ SetSequence(Oid relid, int64 next, bool iscalled)
Page page = BufferGetPage(buf);

 		XLogBeginInsert();
-		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT_BIRTH);

 		xlrec.locator = seqrel->rd_locator;
 		XLogRegisterData(&xlrec, sizeof(xl_seq_rec));
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eec09ba1de..d11cddb082 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -17422,7 +17422,8 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			if (!smgrisinternalfork(forkNum))
+				smgrcreate(dstrel, forkNum, false);

 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -17432,11 +17433,15 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
 				log_smgrcreate(&newrlocator, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+			if (!smgrisinternalfork(forkNum))
+				RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
+									rel->rd_rel->relpersistence);
 		}
 	}

+	smgrcopyrelationmetadata(RelationGetSmgr(rel), dstrel,
+							 rel->rd_rel->relpersistence);
+
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
 	smgrclose(dstrel);
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
index bd839a3e9f..0dad150b2b 100644
--- a/src/backend/storage/map/map.c
+++ b/src/backend/storage/map/map.c
@@ -1382,6 +1382,17 @@ void MapGetNewPbkno(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber
 	Assert(new_pblkno != NULL);
 	Assert(old_pblkno != NULL);

+	/*
+	 * During recovery, MAIN fork physical choices must come from WAL records.
+	 * FSM/VM are hint forks and replay can touch them without dedicated remap
+	 * metadata (e.g. free space updates), so we allow local allocation for
+	 * them.
+	 */
+	if (InRecovery && forknum == MAIN_FORKNUM)
+		elog(PANIC,
+			 "MapGetNewPbkno called during recovery for rel %u/%u/%u fork %d blk %u",
+			 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum, lblkno);
+
 	for (;;)
 	{
 		if (!MapTryLookup(map_ctx, rnode, forknum, lblkno, &cur_pblkno))
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index 3d8909f7a4..07ac7b39c6 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -654,6 +654,18 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					MAIN_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					FSM_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					VISIBILITYMAP_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 			}
 			else
 			{
@@ -661,6 +673,18 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					MAIN_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					FSM_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					VISIBILITYMAP_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 			}
 		}
 		else if (entry->flags & MAPSUPER_FLAG_CORRUPT)
@@ -684,6 +708,18 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					MAIN_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					FSM_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					VISIBILITYMAP_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 			}
 			else
 			{
@@ -691,6 +727,18 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					MAIN_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					FSM_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+				Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+												  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																					VISIBILITYMAP_FORKNUM)) <=
+					   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 			}
 		}
 		else if (entry->flags & MAPSUPER_FLAG_CORRUPT)
@@ -705,6 +753,18 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 		 * should consume that runtime state directly. Disk identity/CRC
 		 * validation belongs to the slow path that populates shared state.
 		 */
+		Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+										  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			MAIN_FORKNUM)) <=
+			   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+		Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+										  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			FSM_FORKNUM)) <=
+			   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+		Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+										  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			VISIBILITYMAP_FORKNUM)) <=
+			   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 		*super = entry->super;
 		status = (entry->flags & MAPSUPER_FLAG_CORRUPT) ?
 			MAP_SBLOCK_READ_CORRUPT : MAP_SBLOCK_READ_OK;
@@ -835,9 +895,6 @@ MapSuperSetExtendingTarget(MapSuperEntry *entry, ForkNumber forknum,
 	}
 }

-
-
-
static bool
MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
XLogRecPtr map_lsn, const char *missing_errmsg,
@@ -955,6 +1012,10 @@ MapSBlockUpdateLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
entry->flags |= MAPSUPER_FLAG_DIRTY;
}

+	Assert(MapNormalizeForkBlockCount(forknum,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		forknum)) <=
+		   MapSuperGetReservedNextFree(entry, forknum));
 	LWLockRelease(&entry->lock);
 }

@@ -1197,6 +1258,19 @@ MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode, XLogRecPtr map_ls
 		map_lsn : GetXLogWriteRecPtr();
 	MapSuperblockSetLastUpdatedLSN(&entry->super, entry->page_lsn);
 	entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_DIRTY;
+	MapSuperResetReservedNextFrees(entry);
+	Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		MAIN_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		FSM_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		VISIBILITYMAP_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));

 	/*
 	 * Persist superblock immediately so later backends in bootstrap/initdb can
@@ -1246,16 +1320,30 @@ MapSBlockEnsureLoaded(UmbraFileContext *map_ctx, RelFileLocator rnode)
 				entry->super = disk_super;
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
+				MapSuperResetReservedNextFrees(entry);
 			}
 			else
 			{
 				MapSuperblockInit(&entry->super, 0);
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
+				MapSuperResetReservedNextFrees(entry);
 			}
 		}
 	}

+	Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		MAIN_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, MAIN_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(FSM_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		FSM_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, FSM_FORKNUM));
+	Assert(MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		VISIBILITYMAP_FORKNUM)) <=
+		   MapSuperGetReservedNextFree(entry, VISIBILITYMAP_FORKNUM));
 	LWLockRelease(&entry->lock);
 	return true;
 }
@@ -1378,8 +1466,6 @@ MapSBlockTryGetNextFreePhysBlock(UmbraFileContext *map_ctx, RelFileLocator rnode
 	return true;
 }

-
-
 void
 MapSBlockBumpLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
 							ForkNumber forknum, BlockNumber nblocks,
@@ -1505,6 +1591,9 @@ MapSuperTableShmemInit(void)
 		entry->next_free =
 			(i == MapSuperCapacity - 1) ? MAPSUPER_FREENEXT_END : (i + 1);
 		entry->in_use = false;
+		entry->reserved_next_free_main = 0;
+		entry->reserved_next_free_fsm = 0;
+		entry->reserved_next_free_vm = 0;
 		entry->extending_target_main = InvalidBlockNumber;
 		entry->extending_target_fsm = InvalidBlockNumber;
 		entry->extending_target_vm = InvalidBlockNumber;
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index f3c24082a6..b1e8dfa9d8 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -250,6 +250,51 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 	if (npending > 1)
 		qsort(pending_writes, npending, sizeof(PendingWrite), buffer_cmp);

+	/*
+	 * For Umbra mapped forks, WAL for new pages cannot be the first place that
+	 * introduces logical coverage for a whole bulk-written run.
+	 *
+	 * log_newpages() writes one WAL record for a batch of pages, but
+	 * wal-owned firstborn is only guaranteed for pages that are already
+	 * contiguous with the current logical EOF. If we WAL-log a whole pending
+	 * run before the relation is physically/logically extended, later
+	 * smgrextend() calls can see a stale logical EOF and mistakenly zeroextend
+	 * over earlier data pages in the same batch.
+	 *
+	 * Extend first so later pages in the run see a stable logical frontier
+	 * while the batch is being prepared.  Umbra may still leave holes in the
+	 * physical frontier for WAL-owned tail pages; that is acceptable as long
+	 * as page WAL publishes the final mapping before the real page images are
+	 * written below.
+	 *
+	 * The important ownership rule is:
+	 *   1. smgrextend() claims first-born ownership and reserves the pblk
+	 *      backend-locally
+	 *   2. the later page WAL record reuses that same in-flight claim and
+	 *      commits the mapping
+	 *
+	 * Do not reorder this to "log_newpages() first, smgrextend() later".
+	 * xloginsert's birth path is allowed to own commit only after the birth
+	 * claim already exists; otherwise a future refactor can reintroduce
+	 * duplicate pblk reservation or stale-EOF zeroextend bugs.
+	 */
+#ifdef USE_UMBRA
+	if (npending > 0)
+	{
+		BlockNumber maxblk;
+
+		maxblk = pending_writes[npending - 1].blkno;
+		while (bulkstate->relsize <= maxblk)
+		{
+			smgrextend(bulkstate->smgr, bulkstate->forknum,
+					   bulkstate->relsize,
+					   &zero_buffer,
+					   true);
+			bulkstate->relsize++;
+		}
+	}
+#endif
+
 	if (bulkstate->use_wal)
 	{
 		BlockNumber blknos[MAX_PENDING_WRITES];
@@ -261,9 +306,9 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 			blknos[i] = pending_writes[i].blkno;
 			pages[i] = pending_writes[i].buf->data;

-			/*
-			 * If any of the pages use !page_std, we log them all as such.
-			 * That's a bit wasteful, but in practice, a mix of standard and
+				/*
+				 * If any of the pages use !page_std, we log them all as such.
+				 * That's a bit wasteful, but in practice, a mix of standard and
 			 * non-standard page layout is rare.  None of the built-in AMs do
 			 * that.
 			 */
@@ -279,7 +324,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 		BlockNumber blkno = pending_writes[i].blkno;
 		Page		page = pending_writes[i].buf->data;

-		PageSetChecksum(page, blkno);
+			PageSetChecksum(page, blkno);

 		if (blkno >= bulkstate->relsize)
 		{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 1e3e0b08f8..8ba29edc56 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -694,11 +694,6 @@ smgrinvalidatedatabase(Oid dbid)
 	smgrinvalidatedatabasetablespaces(dbid, 0, NULL);
 }

-
-
-
-
-
 void
 smgrmarkskipwalpending(RelFileLocator rlocator)
 {
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index 917dff0a64..f382d56c34 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -43,7 +43,6 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/map.h"
-#include "storage/md.h"
 #include "storage/smgr.h"
 #include "storage/umbra.h"
 #include "storage/umfile.h"
@@ -221,6 +220,7 @@ static void um_reserve_fresh_pblkno_for_access(SMgrRelation reln,
 											   BlockNumber lblkno,
 											   BlockNumber *new_pblkno);
 static bool um_fork_uses_map_translation(ForkNumber forknum);
+static bool um_fork_uses_wal_owned_firstborn(ForkNumber forknum);
 static bool um_mapped_exists_from_super(SMgrRelation reln, ForkNumber forknum);
 static UmbraMapPolicy um_open_map_state(SMgrRelation reln);
 static bool um_state_uses_map(UmbraMapPolicy state);
@@ -393,6 +393,8 @@ UmApplyReservedRangeRemap(SMgrRelation reln, ForkNumber forknum,

 	if (max_pblkno != InvalidBlockNumber)
 	{
+		MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
+									   forknum, max_pblkno + 1, lsn);
 		MapSBlockBumpPhysicalNblocks(ctx, reln->smgr_rlocator.locator,
 									 forknum, max_pblkno + 1, lsn);
 		for (BlockNumber i = 0; i < nblocks; i++)
@@ -629,6 +631,13 @@ um_fork_uses_map_translation(ForkNumber forknum)
  * concentrated in a few helper paths, so we keep them on explicit mapping
  * publication rather than tying first-born ownership to arbitrary page WAL.
  */
+static bool
+um_fork_uses_wal_owned_firstborn(ForkNumber forknum)
+{
+	return UmbraForkUsesMapTranslation(forknum) &&
+		!UmbraForkIsAuxiliaryMapped(forknum);
+}
+
 static bool
 um_mapped_exists_from_super(SMgrRelation reln, ForkNumber forknum)
 {
@@ -1131,6 +1140,184 @@ um_resolve_mapped_read_run(SMgrRelation reln, ForkNumber forknum,
 	return 0;
 }

+static void
+um_materialize_pblk_zero_runs(UmbraFileContext *ctx, ForkNumber forknum,
+							  const BlockNumber *pblknos, BlockNumber nblocks,
+							  bool skipFsync)
+{
+	BlockNumber	run_start_pblk = InvalidBlockNumber;
+	BlockNumber	run_blocks = 0;
+
+	Assert(ctx != NULL);
+	Assert(pblknos != NULL);
+
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber pblk = pblknos[i];
+
+		if (run_blocks == 0)
+		{
+			run_start_pblk = pblk;
+			run_blocks = 1;
+		}
+		else if (pblk == run_start_pblk + run_blocks)
+		{
+			run_blocks++;
+		}
+		else
+		{
+			umfile_zeroextend(ctx, forknum, run_start_pblk,
+							  (int) run_blocks, skipFsync);
+			run_start_pblk = pblk;
+			run_blocks = 1;
+		}
+	}
+
+	if (run_blocks > 0)
+		umfile_zeroextend(ctx, forknum, run_start_pblk,
+						  (int) run_blocks, skipFsync);
+}
+
+static bool
+um_pblk_run_is_contiguous(const BlockNumber *pblknos, BlockNumber nblocks)
+{
+	Assert(pblknos != NULL);
+	Assert(nblocks > 0);
+
+	for (BlockNumber i = 1; i < nblocks; i++)
+	{
+		if (pblknos[i] != pblknos[0] + i)
+			return false;
+	}
+
+	return true;
+}
+
+static bool
+um_try_pure_firstborn_range_remap_zeroextend(SMgrRelation reln, ForkNumber forknum,
+											 const UmbraAccessState *access,
+											 BlockNumber blocknum,
+											 BlockNumber nblocks,
+											 bool skipFsync)
+{
+	UmbraFileContext *ctx = um_ctx_acquire(reln);
+	BlockNumber *pblknos;
+	xl_umbra_range_remap_entry *entries;
+	bool		wal_insert_enabled;
+	bool		applied = false;
+
+	Assert(access != NULL);
+	Assert(access->map_available);
+	Assert(nblocks > 0);
+
+	/*
+	 * Try to collapse an EOF zeroextend range into one or more RANGE_REMAP
+	 * records. RANGE_REMAP carries only new pblk ownership, so this helper is
+	 * deliberately all-or-nothing and only accepts pure first-born ranges.
+	 * Recovery consumes authoritative remap WAL instead of synthesizing new
+	 * range ownership locally.
+	 */
+	if (InRecovery || nblocks < 2)
+		return false;
+
+	pblknos = palloc(sizeof(BlockNumber) * nblocks);
+	entries = palloc(sizeof(xl_umbra_range_remap_entry) * nblocks);
+
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = blocknum + i;
+		BlockNumber pblk;
+
+		if (MapTryLookup(ctx, reln->smgr_rlocator.locator, forknum, lblk, &pblk) ||
+			MapInflightLookupOwnedPblk(reln->smgr_rlocator.locator,
+									   forknum, lblk, &pblk))
+		{
+			/*
+			 * Normal EOF extension should not get here.  Treat existing or
+			 * in-flight ownership as a compatibility fallback condition, not as
+			 * a mixed-range batching opportunity.
+			 */
+			pfree(entries);
+			pfree(pblknos);
+			return false;
+		}
+	}
+
+	for (BlockNumber i = 0; i < nblocks; i++)
+	{
+		BlockNumber lblk = blocknum + i;
+
+		if (!MapReserveFreshPblkno(ctx, reln->smgr_rlocator.locator,
+								   forknum, lblk, &pblknos[i]))
+		{
+			for (BlockNumber j = 0; j < i; j++)
+				MapInflightRelease(reln->smgr_rlocator.locator,
+								   forknum, blocknum + j);
+			pfree(entries);
+			pfree(pblknos);
+			return false;
+		}
+		entries[i].lblkno = lblk;
+		entries[i].new_pblkno = pblknos[i];
+	}
+
+	wal_insert_enabled =
+		XLogInsertAllowed() &&
+		!IsBootstrapProcessingMode() &&
+		!IsInitProcessingMode();
+
+	PG_TRY();
+	{
+		BlockNumber done = 0;
+
+		while (done < nblocks)
+		{
+			BlockNumber chunk_blocks = Min(nblocks - done,
+										   (BlockNumber) UINT16_MAX);
+			XLogRecPtr	map_lsn = InvalidXLogRecPtr;
+
+			if (wal_insert_enabled)
+			{
+				if (um_pblk_run_is_contiguous(pblknos + done, chunk_blocks))
+					map_lsn = log_umbra_range_remap_compact(
+						reln->smgr_rlocator.locator, forknum,
+						blocknum + done, pblknos[done], (uint16) chunk_blocks);
+				else
+					map_lsn = log_umbra_range_remap(
+						reln->smgr_rlocator.locator, forknum,
+						(uint16) chunk_blocks, entries + done);
+			}
+
+			um_materialize_pblk_zero_runs(ctx, forknum, pblknos + done,
+										  chunk_blocks, skipFsync);
+			UmApplyReservedRangeRemap(reln, forknum, blocknum + done,
+									  chunk_blocks, pblknos + done,
+									  map_lsn, skipFsync);
+			done += chunk_blocks;
+		}
+
+		applied = true;
+	}
+	PG_CATCH();
+	{
+		if (!applied)
+		{
+			for (BlockNumber i = 0; i < nblocks; i++)
+				MapInflightRelease(reln->smgr_rlocator.locator,
+								   forknum, blocknum + i);
+		}
+
+		pfree(entries);
+		pfree(pblknos);
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	pfree(entries);
+	pfree(pblknos);
+	return true;
+}
+
 static UmbraMappedBirthResult
 um_publish_mapped_birth(SMgrRelation reln, ForkNumber forknum,
 						const UmbraAccessState *access,
@@ -1139,26 +1326,83 @@ um_publish_mapped_birth(SMgrRelation reln, ForkNumber forknum,
 	UmbraFileContext *ctx = um_ctx_acquire(reln);
 	UmbraMappedBirthResult result;
 	BlockNumber old_pblkno;
+	XLogRecPtr	map_lsn;
+	bool		wal_insert_enabled;
+	bool		wal_owns_firstborn;
+	bool		emit_map_set;

Assert(access->map_available);
- (void) allow_wal_owned_firstborn;

result.mapping_published = false;

+	if (InRecovery && um_fork_uses_wal_owned_firstborn(forknum))
+		elog(PANIC,
+			 "missing WAL mapping during recovery for relation %u/%u/%u fork %d blk %u",
+			 reln->smgr_rlocator.locator.spcOid,
+			 reln->smgr_rlocator.locator.dbOid,
+			 reln->smgr_rlocator.locator.relNumber,
+			 forknum, lblkno);
+
 	MapGetNewPbkno(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
 				   &result.pblkno, &old_pblkno);
 	Assert(old_pblkno == InvalidBlockNumber);

-	MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
-				  result.pblkno, InvalidXLogRecPtr);
-	result.mapping_published = true;
+	wal_owns_firstborn = false;
+
+	if (InRecovery)
+	{
+		map_lsn = GetXLogReplayRecPtr(NULL);
+		MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
+					  result.pblkno, map_lsn);
+		result.mapping_published = true;
+	}
+	else
+	{
+		/*
+		 * Birth ownership needs crash-recovery WAL even at wal_level=minimal.
+		 * XLogIsNeeded() is too weak here because it suppresses WAL that is
+		 * still required to recover eager MAP_SET publication after a crash.
+		 */
+		wal_insert_enabled =
+			XLogInsertAllowed() &&
+			!IsBootstrapProcessingMode() &&
+			!IsInitProcessingMode();
+
+		wal_owns_firstborn =
+			allow_wal_owned_firstborn &&
+			wal_insert_enabled &&
+			UmWalOwnedFirstbornAvailable(reln, forknum, lblkno);
+
+		emit_map_set = !wal_owns_firstborn;
+
+		if (emit_map_set && wal_insert_enabled)
+			map_lsn = log_umbra_map_set(reln->smgr_rlocator.locator, forknum,
+										lblkno, old_pblkno, result.pblkno);
+		else
+			map_lsn = InvalidXLogRecPtr;
+
+		if (emit_map_set)
+		{
+			MapSetMapping(ctx, reln->smgr_rlocator.locator, forknum, lblkno,
+						  result.pblkno, map_lsn);
+			result.mapping_published = true;
+		}
+	}

if (result.mapping_published)
MapSBlockBumpNextFreePhysBlock(ctx, reln->smgr_rlocator.locator,
forknum, result.pblkno + 1,
- InvalidXLogRecPtr);
+ map_lsn);

-	MapInflightRelease(reln->smgr_rlocator.locator, forknum, lblkno);
+	/*
+	 * WAL-owned first-born pages keep their in-flight claim private until WAL
+	 * insertion succeeds and XLogCommitBlockRemapsUmbra() publishes the mapping.
+	 * Advancing the physical frontier or releasing the claim here would
+	 * let xloginsert reserve a second pblk for the same logical birth, leaving
+	 * alternating holes in the initial physical layout.
+	 */
+	if (!wal_owns_firstborn)
+		MapInflightRelease(reln->smgr_rlocator.locator, forknum, lblkno);
 	return result;
 }

@@ -1348,6 +1592,30 @@ UmMapAccessAvailable(SMgrRelation reln, ForkNumber forknum)
return access.map_available;
}

+bool
+UmWalOwnedRemapAvailable(SMgrRelation reln, ForkNumber forknum)
+{
+	UmbraAccessState access;
+
+	access = um_classify_access(reln, forknum);
+	return access.policy == UMBRA_MAP_POLICY_REQUIRE_MAP;
+}
+
+bool
+UmWalOwnedFirstbornAvailable(SMgrRelation reln, ForkNumber forknum,
+							 BlockNumber lblkno)
+{
+	UmbraAccessState access;
+
+	(void) lblkno;
+	access = um_classify_access(reln, forknum);
+	return access.policy == UMBRA_MAP_POLICY_REQUIRE_MAP &&
+		XLogInsertAllowed() &&
+		!IsBootstrapProcessingMode() &&
+		!IsInitProcessingMode() &&
+		um_fork_uses_wal_owned_firstborn(forknum);
+}
+
 bool
 UmMapTryLookupPblkno(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber lblkno, BlockNumber *pblkno)
@@ -1598,6 +1866,12 @@ umzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		return;
 	}

+	if (um_try_pure_firstborn_range_remap_zeroextend(reln, forknum, &access,
+													blocknum,
+													(BlockNumber) nblocks,
+													skipFsync))
+		return;
+
 	/*
 	 * Per-block path for single-block, recovery, or callers that encountered
 	 * pre-existing/pending MAP ownership in the requested range.
@@ -1821,15 +2095,9 @@ um_startreadv_mapped_physical(PgAioHandle *ioh, SMgrRelation reln,
 {
 	if (aux_recovery_read)
 	{
-		uint64		ensured_bytes = 0;
-
 		if (!umfile_ctx_block_exists(ctx, forknum, pblk))
-		{
 			um_ensure_datafork_batch_ready_for_access(reln, forknum, access,
 													  pblk, true /* skipFsync */ );
-			ensured_bytes = BLCKSZ;
-		}
-		(void) ensured_bytes;
 	}

pgaio_io_set_target_smgr(ioh, reln, forknum,
@@ -1852,7 +2120,6 @@ umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
UmbraFileContext *ctx = um_ctx_acquire(reln);
BlockNumber pblk;
bool aux_recovery_read;
-
access = um_classify_access(reln, forknum);

if (!access.map_available)
@@ -1878,7 +2145,9 @@ umstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum,
&lookup_state,
aux_recovery_read, &pblk);
if (run_blocks == 0)
+ {
return;
+ }

if (run_blocks < nblocks)
ioh->handle_data_len = run_blocks;
@@ -2001,8 +2270,8 @@ umwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
pending_barrier.entry_idx = -1;

 			/*
-			 * The barrier serializes physical writes with concurrent remap publication for
-			 * the same logical block.  Claim before lookup so a later relocation
+			 * The barrier serializes physical writes with any foreign in-flight
+			 * remap for the same logical block.  Claim before lookup so a later remap
 			 * cannot publish a new mapping while this write is still targeting the
 			 * old physical page.
 			 */
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cccc4a24c8..8816e26f1f 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -22,6 +22,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -366,6 +367,8 @@ pg_relation_size(PG_FUNCTION_ARGS)
 	Oid			relOid = PG_GETARG_OID(0);
 	text	   *forkName = PG_GETARG_TEXT_PP(1);
 	Relation	rel;
+	SMgrRelation smgr;
+	ForkNumber	forknum;
 	int64		size;

rel = try_relation_open(relOid, AccessShareLock);
@@ -380,8 +383,15 @@ pg_relation_size(PG_FUNCTION_ARGS)
if (rel == NULL)
PG_RETURN_NULL();

-	size = calculate_relation_size(&(rel->rd_locator), rel->rd_backend,
-								   forkname_to_number(text_to_cstring(forkName)));
+	forknum = forkname_to_number(text_to_cstring(forkName));
+	smgr = RelationGetSmgr(rel);
+
+	/*
+	 * Umbra may remap a relation's logical blocks onto a sparse physical file.
+	 * SQL-visible relation size follows the storage manager's logical block
+	 * count, not raw stat(2) bytes.
+	 */
+	size = (int64) smgrnblocks(smgr, forknum) * BLCKSZ;

relation_close(rel, AccessShareLock);

diff --git a/src/bin/pg_waldump/.gitignore b/src/bin/pg_waldump/.gitignore
index ec51f41c76..8d694dc47a 100644
--- a/src/bin/pg_waldump/.gitignore
+++ b/src/bin/pg_waldump/.gitignore
@@ -21,6 +21,7 @@
 /spgdesc.c
 /standbydesc.c
 /tblspcdesc.c
+/umbradesc.c
 /xactdesc.c
 /xlogdesc.c

diff --git a/src/bin/pg_waldump/Makefile b/src/bin/pg_waldump/Makefile
index aabb87566a..f493f2b48d 100644
--- a/src/bin/pg_waldump/Makefile
+++ b/src/bin/pg_waldump/Makefile
@@ -24,6 +24,14 @@ override CPPFLAGS := -DFRONTEND -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils

 RMGRDESCSOURCES = $(sort $(notdir $(wildcard $(top_srcdir)/src/backend/access/rmgrdesc/*desc*.c)))
+
+# Umbra adds rmgrdesc/umbradesc.c, which should only be built when Umbra is
+# enabled.  pg_waldump uses a wildcard to compile all rmgrdesc sources, so we
+# must explicitly filter it out for md builds.
+ifneq ($(with_umbra), yes)
+RMGRDESCSOURCES := $(filter-out umbradesc.c,$(RMGRDESCSOURCES))
+endif
+
 RMGRDESCOBJS = $(patsubst %.c,%.o,$(RMGRDESCSOURCES))

@@ -52,6 +60,7 @@ uninstall:

clean distclean:
rm -f pg_waldump$(X) $(OBJS) $(RMGRDESCSOURCES) xlogreader.c xlogstats.c
+ rm -f umbradesc.c umbradesc.o umbradesc.bc
rm -rf tmp_check

 check:
diff --git a/src/include/access/umbra_xlog.h b/src/include/access/umbra_xlog.h
index cb0c2bac57..6b2408d33c 100644
--- a/src/include/access/umbra_xlog.h
+++ b/src/include/access/umbra_xlog.h
@@ -5,6 +5,8 @@
  *
  * Umbra logs these record types:
  * - MAP_SET: establish/switch lblkno -> pblkno mapping
+ * - RANGE_REMAP: atomically establish a range of first-born mappings
+ * - RANGE_REMAP_COMPACT: same semantics for contiguous lblk/pblk runs
  * - SKIP_WAL_DENSE_MAP: record non-empty skip-WAL dense lblk==pblk frontiers
  *
  *-------------------------------------------------------------------------
@@ -19,6 +21,8 @@

 /* XLOG gives us high 4 bits */
 #define XLOG_UMBRA_MAP_SET			0x10
+#define XLOG_UMBRA_RANGE_REMAP		0x30
+#define XLOG_UMBRA_RANGE_REMAP_COMPACT	0x50
 #define XLOG_UMBRA_SKIP_WAL_DENSE_MAP	0x60

typedef struct xl_umbra_map_set
@@ -30,6 +34,32 @@ typedef struct xl_umbra_map_set
BlockNumber new_pblkno;
} xl_umbra_map_set;

+typedef struct xl_umbra_range_remap_entry
+{
+	BlockNumber	lblkno;
+	BlockNumber	new_pblkno;
+} xl_umbra_range_remap_entry;
+
+typedef struct xl_umbra_range_remap
+{
+	RelFileLocator rlocator;
+	ForkNumber	forknum;
+	uint16		count;
+	uint16		padding;
+	BlockNumber end_lblkno;
+	xl_umbra_range_remap_entry entries[FLEXIBLE_ARRAY_MEMBER];
+} xl_umbra_range_remap;
+
+typedef struct xl_umbra_range_remap_compact
+{
+	RelFileLocator rlocator;
+	ForkNumber	forknum;
+	uint16		count;
+	uint16		padding;
+	BlockNumber	first_lblkno;
+	BlockNumber	first_pblkno;
+} xl_umbra_range_remap_compact;
+
 typedef struct xl_umbra_skip_wal_dense_map_entry
 {
 	ForkNumber	forknum;
@@ -47,6 +77,15 @@ typedef struct xl_umbra_skip_wal_dense_map
 extern XLogRecPtr log_umbra_map_set(RelFileLocator rlocator, ForkNumber forknum,
 									BlockNumber lblkno, BlockNumber old_pblkno,
 									BlockNumber new_pblkno);
+extern XLogRecPtr log_umbra_range_remap(RelFileLocator rlocator,
+										ForkNumber forknum,
+										uint16 count,
+										const xl_umbra_range_remap_entry *entries);
+extern XLogRecPtr log_umbra_range_remap_compact(RelFileLocator rlocator,
+												ForkNumber forknum,
+												BlockNumber first_lblkno,
+												BlockNumber first_pblkno,
+												uint16 count);
 extern XLogRecPtr log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
 											   uint16 count,
 											   const xl_umbra_skip_wal_dense_map_entry *entries);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 97eae2c1da..a71fa05f71 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -131,6 +131,13 @@ typedef struct

 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
+#ifdef USE_UMBRA
+	bool		has_remap;
+	BlockNumber old_pblkno;
+	BlockNumber new_pblkno;
+	BlockNumber logical_nblocks;
+	BlockNumber next_free_pblkno;
+#endif

 	/* Information on full-page image, if any */
 	bool		has_image;		/* has image, even for consistency checking */
@@ -424,6 +431,10 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].apply_image)
+#ifdef USE_UMBRA
+#define XLogRecBlockHasRemap(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_remap)
+#endif
 #define XLogRecHasBlockData(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].has_data)

diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 80764f9a26..24b1916a11 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -130,6 +130,27 @@ typedef struct XLogRecordBlockHeader

#define SizeOfXLogRecordBlockHeader (offsetof(XLogRecordBlockHeader, data_length) + sizeof(uint16))

+/*
+ * Extra header information for UMBRA remap metadata.
+ *
+ * When BKPBLOCK_HAS_REMAP is set, this header follows
+ * XLogRecordBlockHeader and stores the physical remap transition for the
+ * referenced logical block.
+ */
+typedef struct XLogRecordBlockRemapHeader
+{
+	BlockNumber	old_pblkno;
+	BlockNumber	new_pblkno;
+	BlockNumber logical_nblocks;
+	BlockNumber next_free_pblkno;
+} XLogRecordBlockRemapHeader;
+
+#ifdef USE_UMBRA
+#define SizeOfXLogRecordBlockRemapHeader sizeof(XLogRecordBlockRemapHeader)
+#else
+#define SizeOfXLogRecordBlockRemapHeader 0
+#endif
+
 /*
  * Additional header information when a full-page image is included
  * (i.e. when BKPBLOCK_HAS_IMAGE is set).
@@ -200,6 +221,7 @@ typedef struct XLogRecordBlockCompressHeader
  */
 #define MaxSizeOfXLogRecordBlockHeader \
 	(SizeOfXLogRecordBlockHeader + \
+	 SizeOfXLogRecordBlockRemapHeader + \
 	 SizeOfXLogRecordBlockImageHeader + \
 	 SizeOfXLogRecordBlockCompressHeader + \
 	 sizeof(RelFileLocator) + \
@@ -209,8 +231,19 @@ typedef struct XLogRecordBlockCompressHeader
  * The fork number fits in the lower 4 bits in the fork_flags field. The upper
  * bits are used for flags.
  */
+/*
+ * The fork number is stored in the low bits of fork_flags; the high bits are
+ * used for per-block flags.
+ */
+#ifdef USE_UMBRA
+#define BKPBLOCK_FORK_MASK	0x07
+#define BKPBLOCK_HAS_REMAP	0x08	/* has remap metadata in WAL header */
+#define BKPBLOCK_FLAG_MASK	0xF8
+#else
 #define BKPBLOCK_FORK_MASK	0x0F
+#define BKPBLOCK_HAS_REMAP	0x00
 #define BKPBLOCK_FLAG_MASK	0xF0
+#endif
 #define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage */
 #define BKPBLOCK_HAS_DATA	0x20
 #define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index a527f446f2..55a2de4df7 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -64,11 +64,19 @@ tests += {
       't/053_umbra_map_superblock_watermark.pl',
       't/054_umbra_map_fork_policy.pl',
       't/056_umbra_truncate_superblock.pl',
+      't/057_umbra_remap_crash_consistency.pl',
+      't/058_umbra_2pc_remap_recovery.pl',
       't/061_umbra_fsm_vm_map_translation.pl',
       't/062_umbra_truncate_drop_crash_matrix.pl',
       't/063_umbra_mainfork_head_unlink_checkpoint.pl',
       't/066_umbra_truncate_redo.pl',
+      't/067_umbra_remap_redo.pl',
+      't/068_umbra_old_baseline_checkpoint_window.pl',
+      't/069_umbra_range_remap_zeroextend.pl',
+      't/070_umbra_hash_birth_block_remap.pl',
       't/071_umbra_skip_wal_dense_map.pl',
+      't/072_umbra_ordinary_slim_block_remap.pl',
+      't/074_umbra_torn_page_remap.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/057_umbra_remap_crash_consistency.pl b/src/test/recovery/t/057_umbra_remap_crash_consistency.pl
new file mode 100644
index 0000000000..557de9bb3b
--- /dev/null
+++ b/src/test/recovery/t/057_umbra_remap_crash_consistency.pl
@@ -0,0 +1,74 @@
+# Verify remap-heavy workload remains consistent after crash restart.
+#
+# This is UMBRA-specific and skipped in md mode.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+});
+$node->start();
+
+$node->safe_psql('postgres',
+	q{CREATE TABLE umb_remap_t(id int PRIMARY KEY, payload text);});
+
+$node->safe_psql(
+	'postgres', q{
+CREATE INDEX umb_remap_payload_idx ON umb_remap_t ((left(payload, 16)));
+INSERT INTO umb_remap_t
+SELECT g, repeat('a', 320) FROM generate_series(1, 30000) g;
+CHECKPOINT;
+UPDATE umb_remap_t
+SET payload = md5(id::text) || repeat('u', 280)
+WHERE id % 3 = 0;
+DELETE FROM umb_remap_t WHERE id % 17 = 0;
+INSERT INTO umb_remap_t
+SELECT g, repeat('n', 320) FROM generate_series(30001, 32000) g;
+});
+
+my $before = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum(id)::bigint
+FROM umb_remap_t;
+});
+
+$node->stop('immediate');
+$node->start();
+
+my $after = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum(id)::bigint
+FROM umb_remap_t;
+});
+
+is($after, $before, 'aggregate state preserved across crash restart');
+
+my $idx_count = $node->safe_psql(
+	'postgres', q{
+SET enable_seqscan = off;
+SELECT count(*) FROM umb_remap_t WHERE id BETWEEN 100 AND 30000;
+});
+my $seq_count = $node->safe_psql(
+	'postgres', q{
+SET enable_indexscan = off;
+SET enable_bitmapscan = off;
+SELECT count(*) FROM umb_remap_t WHERE id BETWEEN 100 AND 30000;
+});
+is($idx_count, $seq_count, 'index path and seq path return same rowcount');
+
+done_testing();
diff --git a/src/test/recovery/t/058_umbra_2pc_remap_recovery.pl b/src/test/recovery/t/058_umbra_2pc_remap_recovery.pl
new file mode 100644
index 0000000000..d3e9945df6
--- /dev/null
+++ b/src/test/recovery/t/058_umbra_2pc_remap_recovery.pl
@@ -0,0 +1,90 @@
+# Verify 2PC + remap workload correctness across crash recovery.
+#
+# This is UMBRA-specific and skipped in md mode.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+max_prepared_transactions = 10
+});
+$node->start();
+
+$node->safe_psql('postgres',
+	q{CREATE TABLE umb_2pc_t(id int PRIMARY KEY, payload text);});
+
+$node->safe_psql(
+	'postgres', q{
+CREATE INDEX umb_2pc_payload_idx ON umb_2pc_t ((left(payload, 16)));
+INSERT INTO umb_2pc_t
+SELECT g, repeat('b', 300) FROM generate_series(1, 15000) g;
+CHECKPOINT;
+});
+
+$node->safe_psql(
+	'postgres', q{
+BEGIN;
+UPDATE umb_2pc_t SET payload = 'gx1_' || id::text WHERE id % 5 = 0;
+DELETE FROM umb_2pc_t WHERE id % 97 = 0;
+INSERT INTO umb_2pc_t SELECT g, repeat('x', 300) FROM generate_series(20001, 20500) g;
+PREPARE TRANSACTION 'umbra_gx1';
+});
+
+$node->safe_psql(
+	'postgres', q{
+BEGIN;
+UPDATE umb_2pc_t
+SET payload = 'gx2_' || id::text
+WHERE id % 5 = 1 AND id % 97 <> 0;
+INSERT INTO umb_2pc_t SELECT g, repeat('y', 300) FROM generate_series(21001, 21200) g;
+PREPARE TRANSACTION 'umbra_gx2';
+});
+
+$node->stop('immediate');
+$node->start();
+
+is($node->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM pg_prepared_xacts WHERE gid IN ('umbra_gx1','umbra_gx2');}),
+	'2',
+	'prepared transactions survive crash recovery');
+
+$node->safe_psql('postgres', q{COMMIT PREPARED 'umbra_gx1';});
+$node->safe_psql('postgres', q{ROLLBACK PREPARED 'umbra_gx2';});
+
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_2pc_t;}), '15346',
+	'row count matches expected after commit/rollback prepared');
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_2pc_t WHERE id BETWEEN 20001 AND 20500;}), '500',
+	'gx1 inserted rows are visible');
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_2pc_t WHERE id BETWEEN 21001 AND 21200;}), '0',
+	'gx2 inserted rows are absent');
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_2pc_t WHERE id % 5 = 0 AND payload LIKE 'gx1_%';}), '2970',
+	'gx1 updates are visible with expected count');
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_2pc_t WHERE payload LIKE 'gx2_%';}), '0',
+	'gx2 updates are absent after rollback prepared');
+
+my $idx_count = $node->safe_psql(
+	'postgres', q{
+SET enable_seqscan = off;
+SELECT count(*) FROM umb_2pc_t WHERE id BETWEEN 100 AND 14900;
+});
+my $seq_count = $node->safe_psql(
+	'postgres', q{
+SET enable_indexscan = off;
+SET enable_bitmapscan = off;
+SELECT count(*) FROM umb_2pc_t WHERE id BETWEEN 100 AND 14900;
+});
+is($idx_count, $seq_count, 'index path and seq path match after 2PC recovery');
+
+done_testing();
diff --git a/src/test/recovery/t/067_umbra_remap_redo.pl b/src/test/recovery/t/067_umbra_remap_redo.pl
new file mode 100644
index 0000000000..c554ddc239
--- /dev/null
+++ b/src/test/recovery/t/067_umbra_remap_redo.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_remap');
+
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+]);
+$node->start();
+
+$node->safe_psql(
+	'postgres', q[
+CREATE TABLE umbra_hash(k int, filler text);
+INSERT INTO umbra_hash
+SELECT g % 97, repeat(md5(g::text), 2)
+FROM generate_series(1, 4000) AS g;
+CREATE INDEX umbra_hash_idx ON umbra_hash USING hash (k);
+
+CREATE TABLE umbra_brin(i int, filler text);
+INSERT INTO umbra_brin
+SELECT g, repeat('x', 200)
+FROM generate_series(1, 12000) AS g;
+CREATE INDEX umbra_brin_idx
+ON umbra_brin USING brin (i) WITH (pages_per_range = 1);
+ANALYZE umbra_hash;
+ANALYZE umbra_brin;
+]);
+
+$node->stop('immediate');
+ok($node->start(), 'restart after hash/brin index build crash');
+
+my $hash_plan = $node->safe_psql(
+	'postgres', q[
+SET enable_seqscan = off;
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM umbra_hash WHERE k = 42;
+]);
+like($hash_plan, qr/umbra_hash_idx/, 'hash index plan survived recovery');
+
+is($node->safe_psql('postgres',
+		q[SELECT count(*) FROM umbra_hash WHERE k = 42]),
+	'41', 'hash index-backed equality query returns expected rows');
+
+my $brin_plan = $node->safe_psql(
+	'postgres', q[
+SET enable_seqscan = off;
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM umbra_brin WHERE i BETWEEN 2500 AND 2600;
+]);
+like($brin_plan, qr/umbra_brin_idx/, 'brin index plan survived recovery');
+
+is($node->safe_psql('postgres',
+		q[SELECT count(*) FROM umbra_brin WHERE i BETWEEN 2500 AND 2600]),
+	'101', 'brin range query returns expected rows after recovery');
+
+$node->safe_psql(
+	'postgres', q[
+INSERT INTO umbra_hash
+SELECT 42, repeat('y', 64)
+FROM generate_series(1, 9);
+INSERT INTO umbra_brin
+SELECT g, repeat('z', 200)
+FROM generate_series(12001, 12200) AS g;
+CHECKPOINT;
+]);
+
+$node->stop('immediate');
+ok($node->start(), 'restart after post-recovery indexed writes');
+
+is($node->safe_psql('postgres',
+		q[SELECT count(*) FROM umbra_hash WHERE k = 42]),
+	'50', 'hash index remains usable after second restart');
+
+is($node->safe_psql('postgres',
+		q[SELECT count(*) FROM umbra_brin WHERE i BETWEEN 12100 AND 12150]),
+	'51', 'brin index remains usable after second restart');
+
+done_testing();
diff --git a/src/test/recovery/t/068_umbra_old_baseline_checkpoint_window.pl b/src/test/recovery/t/068_umbra_old_baseline_checkpoint_window.pl
new file mode 100644
index 0000000000..0ed178885e
--- /dev/null
+++ b/src/test/recovery/t/068_umbra_old_baseline_checkpoint_window.pl
@@ -0,0 +1,85 @@
+#
+# Verify that a post-checkpoint remap keeps the old physical page alive long
+# enough for crash recovery before the next checkpoint boundary.
+#
+# The contract under test is:
+# - establish a checkpoint
+# - modify existing logical pages afterwards, so redo must rely on the old
+#   physical page as baseline instead of a new checkpoint image
+# - crash before any later checkpoint
+# - restart must still recover the updated relation correctly
+#
+# This is UMBRA-specific and skipped in md mode.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+checkpoint_timeout = '30min'
+max_wal_size = '4GB'
+});
+$node->start();
+
+$node->safe_psql('postgres',
+	q{CREATE TABLE umb_old_baseline_t(id int PRIMARY KEY, payload text);});
+
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO umb_old_baseline_t
+SELECT g, repeat('a', 700) FROM generate_series(1, 4000) g;
+CHECKPOINT;
+UPDATE umb_old_baseline_t
+SET payload = md5(id::text) || repeat('u', 668)
+WHERE id % 2 = 0;
+});
+
+my $before = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum((left(payload, 8) = md5(id::text)::text)::int)::bigint
+FROM umb_old_baseline_t;
+});
+
+$node->stop('immediate');
+$node->start();
+
+my $after = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum((left(payload, 8) = md5(id::text)::text)::int)::bigint
+FROM umb_old_baseline_t;
+});
+
+is($after, $before,
+	'post-checkpoint remap survives crash before next checkpoint');
+
+is($node->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM umb_old_baseline_t
+		   WHERE id % 2 = 0
+			 AND left(payload, 8) = left(md5(id::text), 8);}),
+	'2000',
+	'even rows were recovered from remap baseline');
+
+is($node->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM umb_old_baseline_t
+		   WHERE id % 2 = 1
+			 AND payload = repeat('a', 700);}),
+	'2000',
+	'odd rows kept original payload');
+
+done_testing();
diff --git a/src/test/recovery/t/069_umbra_range_remap_zeroextend.pl b/src/test/recovery/t/069_umbra_range_remap_zeroextend.pl
new file mode 100644
index 0000000000..6c816ab0be
--- /dev/null
+++ b/src/test/recovery/t/069_umbra_range_remap_zeroextend.pl
@@ -0,0 +1,101 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_range_remap_zeroextend');
+my $input = $node->basedir . '/copy_input.csv';
+
+$node->init(has_archiving => 1);
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+shared_buffers = '256MB'
+max_wal_size = '4GB'
+min_wal_size = '1GB'
+checkpoint_timeout = '1h'
+]);
+$node->start();
+
+open(my $fh, '>', $input) or die "could not create $input: $!";
+my $pad = 'x' x 200;
+for my $i (1 .. 200_000)
+{
+	print {$fh} "$i,$pad\n";
+}
+close($fh);
+
+$node->safe_psql('postgres', q[
+CREATE TABLE umbra_range_probe (id bigint, pad text);
+SELECT pg_switch_wal();
+]);
+
+my $start_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres',
+	qq[COPY umbra_range_probe FROM '$input' WITH (FORMAT csv);]);
+
+my $end_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+SELECT pg_switch_wal();
+CHECKPOINT;
+]);
+$node->stop();
+
+my ($dump_stdout, $dump_stderr) = run_command(
+	[
+		'pg_waldump', '-p', $node->archive_dir,
+		'--start',   $start_lsn,
+		'--end',     $end_lsn
+	]);
+is($dump_stderr, '', 'pg_waldump raw dump completed without stderr');
+
+my ($stats_stdout, $stats_stderr) = run_command(
+	[
+		'pg_waldump', '-p', $node->archive_dir,
+		'--stats=record',
+		'--start', $start_lsn,
+		'--end',   $end_lsn
+	]);
+is($stats_stderr, '', 'pg_waldump stats completed without stderr');
+ok($stats_stdout =~ /Umbra\/RANGE_REMAP(?:_COMPACT)?\s+\d+/,
+   'WAL stats report Umbra range remap records');
+
+my @main_range_lines =
+  grep { /desc: RANGE_REMAP(?:_COMPACT)?/ && $_ !~ /_(?:fsm|vm)\b/ }
+  split /\n/, $dump_stdout;
+ok(@main_range_lines > 0,
+   'raw WAL dump contains main-fork RANGE_REMAP records');
+
+my $main_range_records = 0;
+my $main_range_pages = 0;
+my $max_main_range = 0;
+for my $line (@main_range_lines)
+{
+	if ($line =~ /count (\d+)/)
+	{
+		my $count = $1;
+
+		$main_range_records++;
+		$main_range_pages += $count;
+		$max_main_range = $count if $count > $max_main_range;
+	}
+}
+
+cmp_ok($max_main_range, '>', 1,
+	   'main-fork RANGE_REMAP batches more than one page');
+cmp_ok($main_range_pages - $main_range_records, '>', 0,
+	   'main-fork RANGE_REMAP collapses multiple first-born pages into fewer WAL records');
+
+done_testing();
diff --git a/src/test/recovery/t/070_umbra_hash_birth_block_remap.pl b/src/test/recovery/t/070_umbra_hash_birth_block_remap.pl
new file mode 100644
index 0000000000..ede9a1ff44
--- /dev/null
+++ b/src/test/recovery/t/070_umbra_hash_birth_block_remap.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_hash_birth_block_remap');
+
+$node->init(has_archiving => 1);
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+shared_buffers = '256MB'
+max_wal_size = '4GB'
+min_wal_size = '1GB'
+checkpoint_timeout = '1h'
+]);
+$node->start();
+
+$node->safe_psql('postgres', q[
+CREATE TABLE hash_birth_probe (id bigint);
+CREATE INDEX hash_birth_probe_idx ON hash_birth_probe USING hash (id);
+SELECT pg_switch_wal();
+]);
+
+my $start_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+INSERT INTO hash_birth_probe
+SELECT g
+FROM generate_series(1, 600000) AS g;
+]);
+
+my $end_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+SELECT pg_switch_wal();
+CHECKPOINT;
+]);
+$node->stop();
+
+my ($dump_stdout, $dump_stderr) = run_command(
+	[
+		'pg_waldump', '-b', '-p', $node->archive_dir,
+		'--start', $start_lsn,
+		'--end',   $end_lsn
+	]);
+is($dump_stderr, '', 'pg_waldump block dump completed without stderr');
+
+my @remap_header_lines =
+  grep { /; remap: old_pblk \d+ new_pblk \d+ logical_nblocks \d+ next_free_pblk \d+/ }
+  split /\n/, $dump_stdout;
+
+ok(@remap_header_lines > 0,
+   'raw WAL dump contains full remap block headers for hash index pages');
+
+done_testing();
diff --git a/src/test/recovery/t/072_umbra_ordinary_slim_block_remap.pl b/src/test/recovery/t/072_umbra_ordinary_slim_block_remap.pl
new file mode 100644
index 0000000000..0fe986abfa
--- /dev/null
+++ b/src/test/recovery/t/072_umbra_ordinary_slim_block_remap.pl
@@ -0,0 +1,69 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_ordinary_slim_block_remap');
+
+$node->init(has_archiving => 1);
+$node->append_conf(
+	'postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+shared_buffers = '256MB'
+max_wal_size = '4GB'
+min_wal_size = '1GB'
+checkpoint_timeout = '1h'
+]);
+$node->start();
+
+$node->safe_psql('postgres', q[
+CREATE TABLE ordinary_slim_probe (id bigint, payload text) WITH (fillfactor = 70);
+INSERT INTO ordinary_slim_probe
+SELECT g, repeat('x', 80)
+FROM generate_series(1, 200000) AS g;
+CHECKPOINT;
+SELECT pg_switch_wal();
+]);
+
+my $start_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+UPDATE ordinary_slim_probe
+SET payload = repeat('y', 80)
+WHERE id <= 100000;
+]);
+
+my $end_lsn =
+  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+$node->safe_psql('postgres', q[
+SELECT pg_switch_wal();
+CHECKPOINT;
+]);
+$node->stop();
+
+my ($dump_stdout, $dump_stderr) = run_command(
+	[
+		'pg_waldump', '-b', '-p', $node->archive_dir,
+		'--start', $start_lsn,
+		'--end',   $end_lsn
+	]);
+is($dump_stderr, '', 'pg_waldump block dump completed without stderr');
+
+my @remap_header_lines =
+  grep { /; remap: old_pblk \d+ new_pblk \d+ logical_nblocks \d+ next_free_pblk \d+/ }
+  split /\n/, $dump_stdout;
+
+ok(@remap_header_lines > 0,
+   'raw WAL dump contains full remap block headers for updated heap pages');
+
+done_testing();
diff --git a/src/test/recovery/t/074_umbra_torn_page_remap.pl b/src/test/recovery/t/074_umbra_torn_page_remap.pl
new file mode 100644
index 0000000000..2c427757ed
--- /dev/null
+++ b/src/test/recovery/t/074_umbra_torn_page_remap.pl
@@ -0,0 +1,261 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Verify that Umbra crash recovery can recover a remapped heap page even when
+# the newly allocated physical block contains a torn write image.  In md mode,
+# run the same workload with full_page_writes=off as a negative control: the
+# manually torn heap page must not be recoverable as correct data.
+#
+# The test:
+# - checkpoints a relation, then updates existing heap pages
+# - in Umbra mode, extracts one new physical block number from the remap WAL
+# - in md mode, extracts one updated heap block as the negative control target
+# - kills the server, overwrites half of that physical block, and restarts
+# - verifies Umbra restores the logical relation contents while md/FPW-off
+#   cannot recover the torn page as correct data
+use strict;
+use warnings FATAL => 'all';
+
+use Fcntl qw(O_CREAT O_RDWR SEEK_SET);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $use_umbra = check_pg_config('^#define USE_UMBRA 1$');
+
+sub overwrite_half_physical_block
+{
+	my ($node, $relpath, $block_size, $pblkno) = @_;
+
+	my $seg_blocks = int((1024 * 1024 * 1024) / $block_size);
+	my $segno = int($pblkno / $seg_blocks);
+	my $segblk = $pblkno % $seg_blocks;
+	my $path = $node->data_dir . '/' . $relpath . ($segno == 0 ? '' : ".$segno");
+	my $offset = $segblk * $block_size;
+	my $zeros = "\0" x int($block_size / 2);
+
+	sysopen(my $fh, $path, O_RDWR | O_CREAT, 0600)
+	  or die "could not open $path: $!";
+	binmode($fh);
+	defined(sysseek($fh, $offset, SEEK_SET))
+	  or die "could not seek to $offset in $path: $!";
+	my $written = syswrite($fh, $zeros);
+	die "could not overwrite torn half-page in $path: $!"
+	  unless defined($written) && $written == length($zeros);
+	close($fh) or die "could not close $path: $!";
+
+	return ($path, $offset);
+}
+
+sub setup_node
+{
+	my ($name, $fpw) = @_;
+	my $node = PostgreSQL::Test::Cluster->new($name);
+
+	$node->init();
+	$node->append_conf(
+		'postgresql.conf', qq[
+autovacuum = off
+full_page_writes = $fpw
+shared_buffers = '256MB'
+max_wal_size = '4GB'
+min_wal_size = '1GB'
+checkpoint_timeout = '1h'
+]);
+	$node->start();
+	return $node;
+}
+
+sub prepare_and_update_table
+{
+	my ($node) = @_;
+
+	$node->safe_psql('postgres', q[
+CREATE TABLE umb_torn_page_t(id bigint, payload text)
+  WITH (fillfactor = 70);
+INSERT INTO umb_torn_page_t
+SELECT g, repeat('x', 80)
+FROM generate_series(1, 200000) AS g;
+CHECKPOINT;
+]);
+
+	my $relinfo = $node->safe_psql('postgres', q[
+SELECT (CASE WHEN c.reltablespace = 0
+             THEN d.dattablespace
+             ELSE c.reltablespace
+        END)::text || '/' ||
+       d.oid::text || '/' ||
+       pg_relation_filenode(c.oid)::text || '|' ||
+       pg_relation_filepath(c.oid) || '|' ||
+       current_setting('block_size')
+FROM pg_class c
+JOIN pg_database d ON d.datname = current_database()
+WHERE c.oid = 'umb_torn_page_t'::regclass;
+]);
+	my ($locator, $relpath, $block_size) = split /\|/, $relinfo;
+
+	my $start_lsn =
+	  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+	$node->safe_psql('postgres', q[
+UPDATE umb_torn_page_t
+SET payload = md5(id::text) || repeat('u', 48)
+WHERE id <= 100000;
+]);
+
+	my $before = $node->safe_psql('postgres', relation_signature_sql());
+
+	my $end_lsn =
+	  $node->safe_psql('postgres', q[SELECT pg_current_wal_lsn();]);
+
+	my ($dump_stdout, $dump_stderr) = run_command(
+		[
+			'pg_waldump', '-b', '-p', $node->data_dir . '/pg_wal',
+			'--start', $start_lsn,
+			'--end',   $end_lsn
+		]);
+	$dump_stderr =~
+	  s/^pg_waldump: first record is after [^\n]+, at [^\n]+, skipping over \d+ bytes\n?//m;
+	is($dump_stderr, '',
+		'pg_waldump block dump completed without unexpected stderr');
+
+	return ($locator, $relpath, $block_size, $before, $dump_stdout);
+}
+
+sub relation_signature_sql
+{
+	return q[
+SELECT count(*) || ',' ||
+       md5(string_agg(md5(id::text || ':' || payload), '' ORDER BY id))
+FROM umb_torn_page_t;
+];
+}
+
+sub find_umbra_remap
+{
+	my ($locator, $dump_stdout) = @_;
+
+	foreach my $blkref (split /(?=blkref #\d+:)/, $dump_stdout)
+	{
+		next
+		  unless $blkref =~ /blkref #\d+: rel \Q$locator\E fork main blk (\d+)/;
+		my $lblk = $1;
+		next
+		  unless $blkref =~
+		  /; remap: old_pblk (\d+) new_pblk (\d+) logical_nblocks \d+ next_free_pblk \d+/;
+
+		my ($old, $new) = ($1, $2);
+		next if $old == 4294967295;
+		next if $old == $new;
+
+		return ($lblk, $old, $new);
+	}
+
+	return;
+}
+
+sub find_md_heap_block
+{
+	my ($locator, $dump_stdout) = @_;
+
+	foreach my $blkref (split /(?=blkref #\d+:)/, $dump_stdout)
+	{
+		next
+		  unless $blkref =~ /blkref #\d+: rel \Q$locator\E fork main blk (\d+)/;
+		my $lblk = $1;
+		next if $lblk == 0;
+		return $lblk;
+	}
+
+	return;
+}
+
+sub verify_table_contents
+{
+	my ($node, $before) = @_;
+
+	my $after = $node->safe_psql('postgres', relation_signature_sql());
+
+	is($after, $before,
+		'recovery restores relation contents after torn new physical block');
+
+	is($node->safe_psql(
+			'postgres',
+			q[SELECT count(*) FROM umb_torn_page_t
+			   WHERE id <= 100000
+			     AND left(payload, 8) = left(md5(id::text), 8);]),
+		'100000',
+		'updated rows are visible after recovery');
+
+	is($node->safe_psql(
+			'postgres',
+			q[SELECT count(*) FROM umb_torn_page_t
+			   WHERE id > 100000
+			     AND payload = repeat('x', 80);]),
+		'100000',
+		'unmodified rows remain visible after recovery');
+}
+
+if ($use_umbra)
+{
+	my $node = setup_node('umbra_torn_page_remap', 'on');
+	my ($locator, $relpath, $block_size, $before, $dump_stdout) =
+	  prepare_and_update_table($node);
+
+	my ($target_lblk, $old_pblk, $new_pblk) =
+	  find_umbra_remap($locator, $dump_stdout);
+
+	ok(defined($new_pblk),
+		'update WAL contains a heap remap header with a new physical block');
+	BAIL_OUT('could not locate a concrete Umbra remap block for test relation')
+		unless defined($new_pblk);
+	cmp_ok($new_pblk, '!=', $old_pblk,
+		'selected WAL remap moves the heap page to a different physical block');
+
+	$node->stop('immediate');
+
+	my ($corrupt_path, $corrupt_offset) =
+	  overwrite_half_physical_block($node, $relpath, $block_size, $new_pblk);
+	ok(-e $corrupt_path,
+		'new physical block segment exists after torn-write injection');
+	ok($corrupt_offset >= 0,
+		'torn-write injection targeted a concrete physical offset');
+
+	$node->start();
+	verify_table_contents($node, $before);
+}
+else
+{
+	my $node = setup_node('md_torn_page_fpw_off', 'off');
+	my ($locator, $relpath, $block_size, $before, $dump_stdout) =
+	  prepare_and_update_table($node);
+
+	my $target_lblk = find_md_heap_block($locator, $dump_stdout);
+	ok(defined($target_lblk),
+		'update WAL contains a heap block reference for md negative control');
+	BAIL_OUT('could not locate a concrete md heap block for test relation')
+		unless defined($target_lblk);
+
+	$node->stop('immediate');
+
+	my ($corrupt_path, $corrupt_offset) =
+	  overwrite_half_physical_block($node, $relpath, $block_size, $target_lblk);
+	ok(-e $corrupt_path,
+		'md heap segment exists after torn-write injection');
+	ok($corrupt_offset >= 0,
+		'md torn-write injection targeted a concrete physical offset');
+
+	my $started = $node->start(fail_ok => 1);
+	if (!$started)
+	{
+		pass('md with full_page_writes=off cannot restart from the torn page');
+	}
+	else
+	{
+		my ($ret, $stdout, $stderr) =
+		  $node->psql('postgres', relation_signature_sql());
+		ok($ret != 0 || $stdout ne $before,
+			'md with full_page_writes=off does not recover correct data from the torn page');
+	}
+}
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

#11

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 09/10] umbra: add patch 8 checkpoint/mapwriter writeback and physical preallocation

---
src/backend/common.mk | 2 +-
src/backend/postmaster/Makefile | 5 +
src/backend/postmaster/bgworker.c | 10 +
src/backend/postmaster/mapwriter.c | 184 ++++++++++
src/backend/postmaster/meson.build | 6 +
src/backend/postmaster/postmaster.c | 7 +
src/backend/storage/map/Makefile | 1 +
src/backend/storage/map/map.c | 54 +++
src/backend/storage/map/mapbgproc.c | 323 ++++++++++++++++++
src/backend/storage/map/mapclock.c | 5 +
src/backend/storage/map/mapflush.c | 6 +-
src/backend/storage/map/mapinit.c | 10 +
src/backend/storage/map/mapsuper.c | 17 +-
src/backend/storage/map/meson.build | 1 +
src/backend/storage/smgr/umbra.c | 9 +-
src/backend/storage/smgr/umfile.c | 242 +++++++++++++
.../utils/activity/wait_event_names.txt | 2 +
src/backend/utils/init/postinit.c | 16 +-
src/backend/utils/misc/guc_parameters.dat | 127 +++++++
src/backend/utils/misc/guc_tables.c | 2 +
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/include/postmaster/mapwriter.h | 24 ++
src/include/storage/map.h | 14 +
src/include/storage/map_internal.h | 7 +
src/include/storage/mapsuper_internal.h | 4 +
src/include/storage/umfile.h | 3 +
src/test/recovery/meson.build | 2 +
.../t/055_umbra_mapwriter_activity.pl | 56 +++
.../recovery/t/073_umbra_preallocate_guc.pl | 74 ++++
29 files changed, 1204 insertions(+), 11 deletions(-)
create mode 100644 src/backend/postmaster/mapwriter.c
create mode 100644 src/backend/storage/map/mapbgproc.c
create mode 100644 src/include/postmaster/mapwriter.h
create mode 100644 src/test/recovery/t/055_umbra_mapwriter_activity.pl
create mode 100644 src/test/recovery/t/073_umbra_preallocate_guc.pl

diff --git a/src/backend/common.mk b/src/backend/common.mk
index 61861f5c7e..aacdf0c702 100644
--- a/src/backend/common.mk
+++ b/src/backend/common.mk
@@ -17,7 +17,7 @@ ifneq ($(subdir), src/backend)
 all: $(subsysfilename)
 endif

-objfiles.txt: Makefile $(SUBDIROBJS) $(OBJS)
+objfiles.txt: Makefile $(top_builddir)/src/Makefile.global $(SUBDIROBJS) $(OBJS)
 # Don't rebuild the list if only the OBJS have changed.
 	$(if $(filter-out $(OBJS),$?),( $(if $(SUBDIROBJS),cat $(SUBDIROBJS); )echo $(addprefix $(subdir)/,$(OBJS)) ) >$@,touch $@)

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 55044b2bc6..05cb330024 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -30,4 +30,9 @@ OBJS = \
 	walsummarizer.o \
 	walwriter.o

+ifeq ($(with_umbra), yes)
+OBJS += \
+	mapwriter.o
+endif
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 3914d22a51..45f0abf94a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -20,6 +20,9 @@
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/datachecksum_state.h"
+#ifdef USE_UMBRA
+#include "postmaster/mapwriter.h"
+#endif
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
@@ -167,6 +170,13 @@ static const struct
 		.fn_name = "DataChecksumsWorkerMain",
 		.fn_addr = DataChecksumsWorkerMain
 	}
+#ifdef USE_UMBRA
+	,
+	{
+		.fn_name = "MapWriterMain",
+		.fn_addr = MapWriterMain
+	}
+#endif
 };

 /* Private functions. */
diff --git a/src/backend/postmaster/mapwriter.c b/src/backend/postmaster/mapwriter.c
new file mode 100644
index 0000000000..e659b6be94
--- /dev/null
+++ b/src/backend/postmaster/mapwriter.c
@@ -0,0 +1,184 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapwriter.c
+ *	  Umbra map writer background worker.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/mapwriter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/mapwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/map.h"
+#include "storage/procnumber.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+#define MAPWRITER_HIBERNATE_FACTOR 50
+
+int			MapWriterDelay = 200;
+int			MapWriterMaxPages = 100;
+int			MapWriterPreallocMaxRelations = 32;
+double		MapWriterLRUMultiplier = 2.0;
+
+static void
+MapWriterExitCallback(int code, Datum arg)
+{
+	(void) code;
+	(void) arg;
+	MapStrategyNotifyWriter(INVALID_PROC_NUMBER);
+}
+
+void
+MapBackgroundWorkersRegister(void)
+{
+	BackgroundWorker bgw;
+
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "MapWriterMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "Umbra mapwriter");
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "map writer");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+	RegisterBackgroundWorker(&bgw);
+}
+
+void
+MapWriterMain(Datum arg)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext mapwriter_context;
+	bool		prev_hibernate = false;
+
+	(void) arg;
+	before_shmem_exit(MapWriterExitCallback, 0);
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+	pqsignal(SIGCHLD, SIG_DFL);
+
+	BackgroundWorkerUnblockSignals();
+	BackgroundWorkerInitializeConnectionByOid(InvalidOid, InvalidOid, 0);
+
+	mapwriter_context = AllocSetContextCreate(TopMemoryContext,
+												  "Map Writer",
+											  ALLOCSET_DEFAULT_SIZES);
+	MemoryContextSwitchTo(mapwriter_context);
+
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+		EmitErrorReport();
+
+		LWLockReleaseAll();
+		ConditionVariableCancelSleep();
+		pgstat_report_wait_end();
+		MapAbortBufferIO();
+		MapStrategyNotifyWriter(INVALID_PROC_NUMBER);
+		MapBackendExitCleanup();
+		AtEOXact_Buffers(false);
+		AtEOXact_SMgr();
+		AtEOXact_Files(false);
+		AtEOXact_HashTables(false);
+
+		MemoryContextSwitchTo(mapwriter_context);
+		FlushErrorState();
+		MemoryContextReset(mapwriter_context);
+		RESUME_INTERRUPTS();
+
+		pg_usleep(1000000L);
+		smgrreleaseall();
+	}
+
+	PG_exception_stack = &local_sigjmp_buf;
+
+	for (;;)
+	{
+		uint32		recent_alloc = 0;
+		int			target_pages = 0;
+		int			cleaned = 0;
+		int			prealloc_ops = 0;
+		bool		can_hibernate = false;
+
+		ResetLatch(MyLatch);
+		ProcessMainLoopInterrupts();
+
+		(void) MapSyncStart(NULL, &recent_alloc);
+		if (recent_alloc > 0 && MapWriterPreallocMaxRelations > 0)
+			prealloc_ops = MapPreallocStep(MapWriterPreallocMaxRelations);
+
+		if (MapWriterMaxPages > 0)
+		{
+			int			idle_pages;
+			double		target_f;
+
+			idle_pages = Max(1, MapWriterMaxPages / 8);
+			if (recent_alloc > 0)
+			{
+				target_f = recent_alloc * MapWriterLRUMultiplier;
+				target_pages = (int) (target_f + 0.5);
+			}
+			else
+				target_pages = idle_pages;
+
+			target_pages = Min(MapWriterMaxPages, Max(1, target_pages));
+			cleaned = MapBgWriterFlush(target_pages);
+		}
+
+		can_hibernate = (recent_alloc == 0 &&
+						 cleaned == 0 &&
+						 prealloc_ops == 0);
+
+		if (FirstCallSinceLastCheckpoint())
+			smgrreleaseall();
+
+		MapStrategyNotifyWriter(MyProcNumber);
+		if (WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  MapWriterDelay,
+					  WAIT_EVENT_MAPWRITER_MAIN) == WL_TIMEOUT &&
+			can_hibernate &&
+			prev_hibernate)
+		{
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 MapWriterDelay * MAPWRITER_HIBERNATE_FACTOR,
+							 WAIT_EVENT_MAPWRITER_HIBERNATE);
+		}
+		MapStrategyNotifyWriter(INVALID_PROC_NUMBER);
+		prev_hibernate = can_hibernate;
+	}
+}
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 6cba23bbee..0a30057703 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -18,3 +18,9 @@ backend_sources += files(
   'walsummarizer.c',
   'walwriter.c',
 )
+
+if get_option('umbra').enabled()
+  backend_sources += files(
+    'mapwriter.c',
+  )
+endif
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ae82974700..b940fca13b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -102,6 +102,9 @@
 #include "port/pg_getopt_ctx.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
+#ifdef USE_UMBRA
+#include "postmaster/mapwriter.h"
+#endif
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
@@ -922,6 +925,10 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();

+#ifdef USE_UMBRA
+	MapBackgroundWorkersRegister();
+#endif
+
 	/*
 	 * Register the shared memory needs of all core subsystems.
 	 */
diff --git a/src/backend/storage/map/Makefile b/src/backend/storage/map/Makefile
index 94ae1c1b72..6fffc43e59 100644
--- a/src/backend/storage/map/Makefile
+++ b/src/backend/storage/map/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	map.o \
 	mapinit.o \
 	mapbuf.o \
+	mapbgproc.o \
 	mapflush.o \
 	mapclock.o \
 	mapinflight.o \
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
index 0dad150b2b..6793db8671 100644
--- a/src/backend/storage/map/map.c
+++ b/src/backend/storage/map/map.c
@@ -105,6 +105,9 @@ static bool MapMapPageWithinLogicalRange(UmbraFileContext *map_ctx,
 										 RelFileLocator rnode,
 										 ForkNumber forknum,
 										 BlockNumber map_blkno);
+bool MapForkPreallocSettings(ForkNumber forknum, BlockNumber *soft_low,
+							 BlockNumber *hard_low,
+							 BlockNumber *batch_blocks);
 static MapCachedLookupResult MapTryLookupCachedEntry(RelFileLocator rnode,
 													 ForkNumber forknum,
 													 BlockNumber map_blkno,
@@ -142,6 +145,47 @@ MapResetAllTruncatePreloads(void)
 	}
 }

+bool
+MapForkPreallocSettings(ForkNumber forknum, BlockNumber *soft_low,
+						BlockNumber *hard_low, BlockNumber *batch_blocks)
+{
+	int			low;
+	int			hard;
+	int			batch;
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			low = map_prealloc_main_low;
+			hard = map_prealloc_main_hard;
+			batch = map_prealloc_main_batch;
+			break;
+		case FSM_FORKNUM:
+			low = map_prealloc_fsm_low;
+			hard = map_prealloc_fsm_hard;
+			batch = map_prealloc_fsm_batch;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			low = map_prealloc_vm_low;
+			hard = map_prealloc_vm_hard;
+			batch = map_prealloc_vm_batch;
+			break;
+		default:
+			return false;
+	}
+
+	if (low <= 0 || batch <= 0)
+		return false;
+	if (hard <= 0)
+		hard = 1;
+	if (hard > low)
+		hard = low;
+
+	*soft_low = (BlockNumber) low;
+	*hard_low = (BlockNumber) hard;
+	*batch_blocks = (BlockNumber) batch;
+	return true;
+}

 static bool
 MapTruncateEntryRange(ForkNumber forknum, BlockNumber n_lblknos,
@@ -728,6 +772,8 @@ MapReserveFreshPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
 	if (MapReserveNextPblkno(map_ctx, rnode, forknum, lblkno,
 							 new_pblkno, false))
 	{
+		if (!InRecovery)
+			(void) MapMaybePreallocateFork(map_ctx, rnode, forknum, false);
 		return true;
 	}

@@ -1418,6 +1464,14 @@ void MapGetNewPbkno(UmbraFileContext *map_ctx, RelFileLocator rnode, ForkNumber
 				 "failed to reserve physical block for relation %u/%u/%u fork %d blk %u",
 				 rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum, lblkno);
 	}
+
+	/*
+	 * Frontend/backgound coordination:
+	 * - low-but-not-critical watermark: wake mapwriter
+	 * - critical watermark: foreground performs one-shot preallocation
+	 */
+	if (!InRecovery)
+		(void) MapMaybePreallocateFork(map_ctx, rnode, forknum, false);
 }

 /*
diff --git a/src/backend/storage/map/mapbgproc.c b/src/backend/storage/map/mapbgproc.c
new file mode 100644
index 0000000000..3bb167bae9
--- /dev/null
+++ b/src/backend/storage/map/mapbgproc.c
@@ -0,0 +1,323 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapbgproc.c
+ *	  MAP background maintenance and coordination.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogutils.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/map.h"
+#include "storage/map_internal.h"
+#include "storage/mapsuper_internal.h"
+#include "storage/proc.h"
+
+static bool MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
+								 bool background_mode);
+
+uint32
+MapAllocPressurePeek(void)
+{
+	return pg_atomic_read_u32(&MapShared->num_allocs);
+}
+
+void
+MapStrategyNotifyWriter(int mapwriter_procno)
+{
+	SpinLockAcquire(&MapShared->clock_lock);
+	MapShared->mapwriter_procno = mapwriter_procno;
+	SpinLockRelease(&MapShared->clock_lock);
+}
+
+void
+MapWakeWriter(void)
+{
+	int			mapwriter_procno = -1;
+
+	SpinLockAcquire(&MapShared->clock_lock);
+	mapwriter_procno = MapShared->mapwriter_procno;
+	if (mapwriter_procno != -1)
+		MapShared->mapwriter_procno = -1;
+	SpinLockRelease(&MapShared->clock_lock);
+
+	if (mapwriter_procno != -1)
+		SetLatch(&ProcGlobal->allProcs[mapwriter_procno].procLatch);
+}
+
+bool
+MapMaybePreallocateFork(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						ForkNumber forknum, bool background_mode)
+{
+	MapSuperEntry *entry;
+	BlockNumber		soft_low;
+	BlockNumber		hard_low;
+	BlockNumber		batch_blocks;
+	BlockNumber		next;
+	BlockNumber		capacity;
+	BlockNumber		remaining;
+	BlockNumber		target_nblocks;
+	uint32			prealloc_flag;
+	bool			prealloc_ok = false;
+	bool			started = false;
+	uint64			target64;
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	if (!MapForkPreallocSettings(forknum, &soft_low, &hard_low, &batch_blocks))
+		return false;
+
+	if (!MapSBlockEnsureLoaded(map_ctx, rnode))
+		return false;
+
+	prealloc_flag = MapSuperPreallocFlag(forknum);
+	Assert(prealloc_flag != 0);
+
+	if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+		return false;
+
+	if (!entry->in_use || (entry->flags & MAPSUPER_FLAG_VALID) == 0)
+	{
+		LWLockRelease(&entry->lock);
+		return false;
+	}
+
+	if ((entry->flags & MAPSUPER_FLAG_CORRUPT) ||
+		!MapSuperblockHasValidIdentity(&entry->super) ||
+		((entry->flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+		 !MapSuperblockCheckCRC(&entry->super)))
+	{
+		LWLockRelease(&entry->lock);
+		if (!InRecovery)
+			MapSBlockReportCorrupt(rnode, "invalid identity or CRC");
+		return false;
+	}
+
+	Assert(MapNormalizeForkBlockCount(forknum,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		forknum)) <=
+		   MapSuperGetReservedNextFree(entry, forknum));
+	next = MapSuperGetReservedNextFree(entry, forknum);
+	capacity = MapSuperblockGetPhysCapacity(&entry->super, forknum);
+	capacity = MapNormalizeForkBlockCount(forknum, capacity);
+
+	if (next < soft_low)
+	{
+		LWLockRelease(&entry->lock);
+		return false;
+	}
+
+	remaining = (capacity > next) ? (capacity - next) : 0;
+	if (remaining > soft_low)
+	{
+		LWLockRelease(&entry->lock);
+		return false;
+	}
+
+	if (!background_mode && remaining > hard_low)
+	{
+		LWLockRelease(&entry->lock);
+		MapWakeWriter();
+		return false;
+	}
+
+	if ((entry->runtime_flags & prealloc_flag) != 0)
+	{
+		LWLockRelease(&entry->lock);
+		if (!background_mode)
+			MapWakeWriter();
+		return false;
+	}
+
+	target64 = Max((uint64) capacity + (uint64) batch_blocks,
+				   (uint64) next + (uint64) batch_blocks);
+	if (target64 > (uint64) (InvalidBlockNumber - 1))
+	{
+		LWLockRelease(&entry->lock);
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot preallocate physical blocks beyond %u for relation %u/%u/%u fork %d",
+						InvalidBlockNumber - 1,
+						rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum)));
+	}
+	target_nblocks = (BlockNumber) target64;
+
+	if (target_nblocks <= capacity)
+	{
+		LWLockRelease(&entry->lock);
+		return false;
+	}
+
+	entry->runtime_flags |= prealloc_flag;
+	LWLockRelease(&entry->lock);
+	started = true;
+
+	PG_TRY();
+	{
+		if (umfile_ctx_fork_exists(map_ctx, forknum, UMFILE_EXISTS_SPARSE))
+			prealloc_ok = umfile_ctx_preallocate_blocks(map_ctx, forknum,
+														UMFILE_NBLOCKS_SPARSE,
+														target_nblocks);
+	}
+	PG_CATCH();
+	{
+		if (started && MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+		{
+			entry->runtime_flags &= ~prealloc_flag;
+			LWLockRelease(&entry->lock);
+		}
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	if (MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+	{
+		if (prealloc_ok &&
+			entry->in_use &&
+			(entry->flags & MAPSUPER_FLAG_VALID) != 0 &&
+			MapNormalizeForkBlockCount(forknum,
+									   MapSuperblockGetPhysCapacity(&entry->super,
+																   forknum)) < target_nblocks)
+		{
+			XLogRecPtr	map_lsn = GetXLogWriteRecPtr();
+
+			MapSuperblockSetPhysCapacity(&entry->super, forknum, target_nblocks);
+			MapSuperblockSetLastUpdatedLSN(&entry->super, map_lsn);
+			entry->page_lsn = map_lsn;
+			entry->flags |= MAPSUPER_FLAG_DIRTY;
+		}
+		entry->runtime_flags &= ~prealloc_flag;
+		LWLockRelease(&entry->lock);
+	}
+
+	if (!background_mode && !prealloc_ok)
+		MapWakeWriter();
+
+	return prealloc_ok;
+}
+
+static bool
+MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
+					 bool background_mode)
+{
+	BlockNumber		soft_low;
+	BlockNumber		hard_low;
+	BlockNumber		batch_blocks;
+	BlockNumber		next;
+	BlockNumber		capacity;
+	BlockNumber		remaining;
+	uint32			prealloc_flag;
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+
+	if (!MapForkPreallocSettings(forknum, &soft_low, &hard_low, &batch_blocks))
+		return false;
+
+	if (!MapSuperForkExists(&entry->super, forknum))
+		return false;
+
+	prealloc_flag = MapSuperPreallocFlag(forknum);
+	Assert(prealloc_flag != 0);
+
+	Assert(MapNormalizeForkBlockCount(forknum,
+									  MapSuperblockGetNextFreePhysBlock(&entry->super,
+																		forknum)) <=
+		   MapSuperGetReservedNextFree(entry, forknum));
+	next = MapSuperGetReservedNextFree(entry, forknum);
+	capacity = MapNormalizeForkBlockCount(forknum,
+										  MapSuperblockGetPhysCapacity(&entry->super,
+																   forknum));
+
+	if (next < soft_low)
+		return false;
+
+	remaining = (capacity > next) ? (capacity - next) : 0;
+	if (remaining > soft_low)
+		return false;
+
+	if (!background_mode && remaining > hard_low)
+		return false;
+
+	if ((entry->runtime_flags & prealloc_flag) != 0)
+		return false;
+
+	return true;
+}
+
+int
+MapPreallocStep(int max_relations)
+{
+	static int	scan_slot = 0;
+	int			max_scan;
+	int			scanned = 0;
+	int			visited = 0;
+	int			prealloc_ops = 0;
+
+	if (InRecovery || max_relations <= 0 || MapSuperCapacity <= 0)
+		return 0;
+
+	max_scan = Min(MapSuperCapacity, Max(64, max_relations * 8));
+
+	while (scanned < max_scan && visited < max_relations)
+	{
+		MapSuperEntry *entry;
+		RelFileLocator	rnode;
+		RelFileLocatorBackend rlocator;
+		UmbraFileContext *ctx;
+		bool		prealloc_main;
+		bool		prealloc_fsm;
+		bool		prealloc_vm;
+
+		entry = MapSuperEntryBySlot(scan_slot);
+		scan_slot = (scan_slot + 1) % MapSuperCapacity;
+		scanned++;
+
+		LWLockAcquire(&entry->lock, LW_SHARED);
+		if (!entry->in_use)
+		{
+			LWLockRelease(&entry->lock);
+			continue;
+		}
+		rnode = entry->key.rnode;
+		if ((entry->flags & MAPSUPER_FLAG_VALID) == 0 ||
+			(entry->flags & MAPSUPER_FLAG_CORRUPT) != 0 ||
+			!MapSuperblockHasValidIdentity(&entry->super) ||
+			((entry->flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+			 !MapSuperblockCheckCRC(&entry->super)))
+		{
+			LWLockRelease(&entry->lock);
+			continue;
+		}
+		prealloc_main = MapForkNeedsPrealloc(entry, MAIN_FORKNUM, true);
+		prealloc_fsm = MapForkNeedsPrealloc(entry, FSM_FORKNUM, true);
+		prealloc_vm = MapForkNeedsPrealloc(entry, VISIBILITYMAP_FORKNUM, true);
+		LWLockRelease(&entry->lock);
+		visited++;
+
+		if (!prealloc_main && !prealloc_fsm && !prealloc_vm)
+			continue;
+
+		rlocator.locator = rnode;
+		rlocator.backend = INVALID_PROC_NUMBER;
+		ctx = umfile_ctx_acquire(rlocator);
+		if (ctx == NULL)
+			continue;
+
+		if (prealloc_main &&
+			MapMaybePreallocateFork(ctx, rnode, MAIN_FORKNUM, true))
+			prealloc_ops++;
+		if (prealloc_fsm &&
+			MapMaybePreallocateFork(ctx, rnode, FSM_FORKNUM, true))
+			prealloc_ops++;
+		if (prealloc_vm &&
+			MapMaybePreallocateFork(ctx, rnode, VISIBILITYMAP_FORKNUM, true))
+			prealloc_ops++;
+	}
+
+	return prealloc_ops;
+}
diff --git a/src/backend/storage/map/mapclock.c b/src/backend/storage/map/mapclock.c
index 3ccdbb2310..5e4eb5dac7 100644
--- a/src/backend/storage/map/mapclock.c
+++ b/src/backend/storage/map/mapclock.c
@@ -273,6 +273,11 @@ MapClockGetBuffer(void)
 	uint32      local_buf_state;
 	int         num_slots = MapShared->num_slots;

+	/*
+	 * If mapwriter asked for allocation notification, wake it up.
+	 */
+	MapWakeWriter();
+
 	/*
 	 * First, check if there's a buffer on the free list.
 	 */
diff --git a/src/backend/storage/map/mapflush.c b/src/backend/storage/map/mapflush.c
index def1943dee..b13991e27a 100644
--- a/src/backend/storage/map/mapflush.c
+++ b/src/backend/storage/map/mapflush.c
@@ -226,7 +226,7 @@ map_flush_buffer_target_comparator(const MapFlushBufferTarget *a,
 void
 MapPreCheckpoint(void)
 {
-	/* no-op: checkpoint work is handled by MapCheckpoint(). */
+	/* no-op: reclaim is handled by sync request queues. */
 }

 /*
@@ -346,7 +346,7 @@ MapCheckpointDatabaseTablespaces(Oid dbid, int ntablespaces,
 void
 MapPostCheckpoint(void)
 {
-	/* no-op: checkpoint work is handled by MapCheckpoint(). */
+	/* no-op: reclaim is handled by sync request queues. */
 }

int
@@ -355,7 +355,7 @@ MapBgWriterFlush(int max_pages)
if (max_pages <= 0)
return 0;

-	/* Non-checkpoint flushes regular MAP pages only; superblock is checkpoint-owned. */
+	/* mapwriter flushes regular MAP pages only; superblock is checkpoint-owned. */
 	return MapFlushDirtyBuffers(max_pages, false);
 }

diff --git a/src/backend/storage/map/mapinit.c b/src/backend/storage/map/mapinit.c
index c9ddd12ff0..c30057cf04 100644
--- a/src/backend/storage/map/mapinit.c
+++ b/src/backend/storage/map/mapinit.c
@@ -26,6 +26,15 @@ int			map_buffers = 1024;	/* Number of map buffer slots */
  * relations do not churn through repeated ensure/load cycles.
  */
 int			map_superblocks = 262144;
+int			map_prealloc_main_low = 512;	/* 4MB in 8k blocks */
+int			map_prealloc_main_hard = 128;	/* 1MB in 8k blocks */
+int			map_prealloc_main_batch = 1024; /* 8MB in 8k blocks */
+int			map_prealloc_fsm_low = 64;	/* 512kB in 8k blocks */
+int			map_prealloc_fsm_hard = 16;	/* 128kB in 8k blocks */
+int			map_prealloc_fsm_batch = 128; /* 1MB in 8k blocks */
+int			map_prealloc_vm_low = 64;	/* 512kB in 8k blocks */
+int			map_prealloc_vm_hard = 16;	/* 128kB in 8k blocks */
+int			map_prealloc_vm_batch = 128; /* 1MB in 8k blocks */

/* Shared memory pointer */
MapSharedData *MapShared = NULL;
@@ -105,6 +114,7 @@ MapShmemInit(void *arg)

 	MapShared->num_slots = map_buffers;
 	MapShared->first_free_buffer = 0;
+	MapShared->mapwriter_procno = -1;
 	pg_atomic_init_u32(&MapShared->next_victim_buffer, 0);
 	pg_atomic_init_u32(&MapShared->num_allocs, 0);
 	MapShared->complete_passes = 0;
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index 07ac7b39c6..e3e9421566 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -838,6 +838,21 @@ MapSuperForkExists(const MapSuperblock *super, ForkNumber forknum)
 	return MapSuperblockGetLogicalNblocks(super, forknum) != InvalidBlockNumber;
 }

+uint32
+MapSuperPreallocFlag(ForkNumber forknum)
+{
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_PREALLOC_MAIN;
+		case FSM_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_PREALLOC_FSM;
+		case VISIBILITYMAP_FORKNUM:
+			return MAPSUPER_RUNTIME_FLAG_PREALLOC_VM;
+		default:
+			return 0;
+	}
+}

static uint32
MapSuperExtendingFlag(ForkNumber forknum)
@@ -1274,7 +1289,7 @@ MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode, XLogRecPtr map_ls

 	/*
 	 * Persist superblock immediately so later backends in bootstrap/initdb can
-	 * read block 0 even before checkpoint gets a chance to flush.
+	 * read block 0 even before checkpoint/mapwriter gets a chance to flush.
 	 * This keeps create-time O(1): only one 512-byte sector is written.
 	 */
 	write_super = entry->super;
diff --git a/src/backend/storage/map/meson.build b/src/backend/storage/map/meson.build
index bdaa0dd14a..5a9f685e53 100644
--- a/src/backend/storage/map/meson.build
+++ b/src/backend/storage/map/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
   'map.c',
   'mapinit.c',
   'mapbuf.c',
+  'mapbgproc.c',
   'mapflush.c',
   'mapclock.c',
   'mapinflight.c',
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index f382d56c34..61c74a2378 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -309,7 +309,7 @@ UmMetadataImmediateSync(SMgrRelation reln)
 void
 UmMetadataRegisterSync(SMgrRelation reln)
 {
-	umimmedsync(reln, UMBRA_METADATA_FORKNUM);
+	umfile_registersync(um_ctx_acquire(reln), UMBRA_METADATA_FORKNUM);
 }

 void
@@ -2527,12 +2527,17 @@ umimmedsync(SMgrRelation reln, ForkNumber forknum)
 void
 umregistersync(SMgrRelation reln, ForkNumber forknum)
 {
-	umimmedsync(reln, forknum);
+	umfile_registersync(um_ctx_acquire(reln), forknum);
 }

 bool
 umpreparependingsync(SMgrRelation reln)
 {
+	/*
+	 * Skip-WAL relations write data directly first and publish Umbra MAP
+	 * metadata only at the durable transition boundary. Rebuild MAP and
+	 * superblock before the relation enters the fsync path.
+	 */
 	if (RelFileLocatorSkippingWAL(reln->smgr_rlocator.locator))
 		UmRebuildMapAndSuperblockForSkipWAL(reln);

diff --git a/src/backend/storage/smgr/umfile.c b/src/backend/storage/smgr/umfile.c
index 63afc8546c..a88e54c46b 100644
--- a/src/backend/storage/smgr/umfile.c
+++ b/src/backend/storage/smgr/umfile.c
@@ -14,6 +14,9 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/uio.h>
+#if defined(__linux__)
+#include <sys/syscall.h>
+#endif

 #include "access/xlogutils.h"
 #include "catalog/pg_tablespace_d.h"
@@ -86,6 +89,8 @@ static BlockNumber umfile_nblocks_dense(UmbraFileContext *ctx,
 										RelFileLocatorBackend rlocator,
 										ForkNumber forknum);
 static BlockNumber umfile_nblocks_in_seg(File vfd);
+static bool umfile_preallocate_fd(File fd, off_t target_bytes);
+static bool umfile_preallocate_errno_is_unsupported(int err);
 static bool umfile_collect_existing_segnos_by_path(const char *seg0path,
 												   BlockNumber **segnos_out,
 												   int *nsegnos_out);
@@ -314,6 +319,69 @@ umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
 	umfile_extend(ctx, forknum, blkno, buffer, true /* skipFsync */ );
 }

+bool
+umfile_ctx_preallocate_blocks(UmbraFileContext *ctx, ForkNumber forknum,
+							  UmFileNblocksMode mode,
+							  BlockNumber target_nblocks)
+{
+	BlockNumber	nblocks;
+	BlockNumber	cur_nblocks;
+
+	if (ctx == NULL)
+		return false;
+
+	if (!umfile_exists(ctx, forknum,
+					   mode == UMFILE_NBLOCKS_SPARSE ?
+					   UMFILE_EXISTS_SPARSE :
+					   UMFILE_EXISTS_DENSE))
+		return false;
+
+	nblocks = umfile_nblocks(ctx, forknum, mode);
+	if (target_nblocks <= nblocks)
+		return true;
+
+	if (target_nblocks > (uint64) MaxBlockNumber + 1)
+		return false;
+
+	/*
+	 * Keep preallocation as a capacity operation: make the target segment
+	 * BLCKSZ-addressable up to target_nblocks, but do not write page content.
+	 * Later page writes may fill holes created by this step.
+	 */
+	cur_nblocks = nblocks;
+	while (cur_nblocks < target_nblocks)
+	{
+		BlockNumber	target_blkno;
+		BlockNumber	targetseg;
+		BlockNumber	targetseg_nblocks;
+		uint64		seg_start;
+		uint64		seg_end;
+		off_t		target_bytes;
+		UmfdVec	   *v;
+
+		targetseg = cur_nblocks / ((BlockNumber) RELSEG_SIZE);
+		seg_start = (uint64) targetseg * (uint64) RELSEG_SIZE;
+		seg_end = seg_start + (uint64) RELSEG_SIZE;
+		if (seg_end > (uint64) target_nblocks)
+			seg_end = (uint64) target_nblocks;
+
+		targetseg_nblocks = (BlockNumber) (seg_end - seg_start);
+		target_blkno = (BlockNumber) (seg_start + targetseg_nblocks - 1);
+		target_bytes = (off_t) targetseg_nblocks * BLCKSZ;
+
+		v = umfile_getseg(ctx, ctx->rlocator, forknum, target_blkno,
+						  true /* skipFsync */,
+						  UM_EXTENSION_CREATE,
+						  RelFileLocatorBackendIsTemp(ctx->rlocator));
+		if (!umfile_preallocate_fd(v->umfd_vfd, target_bytes))
+			return false;
+
+		cur_nblocks = (BlockNumber) seg_end;
+	}
+
+	return true;
+}
+
 void
 umfile_ctx_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno)
 {
@@ -1566,6 +1634,180 @@ umfile_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
 		umfile_register_dirty_seg(rlocator, false, forknum, v);
 }

+/*
+ * Reserve bytes for a relation segment without writing page contents.
+ *
+ * Return true only if the whole range up to target_bytes is backed by a real
+ * preallocation primitive. Unsupported filesystems return false so callers do
+ * not publish capacity that is only a logical EOF extension.
+ *
+ * Linux uses the fallocate syscall directly so we don't inherit glibc's
+ * posix_fallocate()->userspace zero-fill fallback. macOS uses F_PREALLOCATE,
+ * and other platforms may use posix_fallocate().
+ */
+static bool
+umfile_preallocate_fd(File fd, off_t target_bytes)
+{
+	off_t		current_bytes;
+	off_t		delta_bytes;
+	int			rawfd;
+
+	current_bytes = FileSize(fd);
+	if (current_bytes < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not determine file size for \"%s\": %m",
+						FilePathName(fd))));
+
+	if (current_bytes >= target_bytes)
+		return true;
+
+	delta_bytes = target_bytes - current_bytes;
+	rawfd = FileGetRawDesc(fd);
+
+#if defined(__linux__)
+#ifdef SYS_fallocate
+	{
+		long		rc;
+
+retry_fallocate:
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_EXTEND);
+		rc = syscall(SYS_fallocate, rawfd, 0, current_bytes, delta_bytes);
+		pgstat_report_wait_end();
+
+		if (rc < 0)
+		{
+			if (errno == EINTR)
+				goto retry_fallocate;
+			if (umfile_preallocate_errno_is_unsupported(errno))
+				return false;
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not preallocate file \"%s\" to %llu bytes: %m",
+							FilePathName(fd),
+							(unsigned long long) target_bytes)));
+		}
+	}
+#else
+	return false;
+#endif
+#elif defined(__APPLE__) && defined(F_PREALLOCATE)
+	{
+		fstore_t	fst;
+		int			rc;
+
+		/*
+		 * F_PEOFPOSMODE interprets fst_length as newly allocated bytes beyond
+		 * current EOF, not the final file size. Passing target_bytes here would
+		 * over-reserve on repeated top-ups of the same segment.
+		 */
+		MemSet(&fst, 0, sizeof(fst));
+		fst.fst_flags = F_ALLOCATECONTIG | F_ALLOCATEALL;
+		fst.fst_posmode = F_PEOFPOSMODE;
+		fst.fst_offset = 0;
+		fst.fst_length = delta_bytes;
+
+retry_f_preallocate_contig:
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_EXTEND);
+		rc = fcntl(rawfd, F_PREALLOCATE, &fst);
+		pgstat_report_wait_end();
+		if (rc < 0 && errno == EINTR)
+			goto retry_f_preallocate_contig;
+
+		if (rc < 0)
+		{
+			fst.fst_flags = F_ALLOCATEALL;
+			fst.fst_bytesalloc = 0;
+
+retry_f_preallocate_all:
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_EXTEND);
+			rc = fcntl(rawfd, F_PREALLOCATE, &fst);
+			pgstat_report_wait_end();
+			if (rc < 0 && errno == EINTR)
+				goto retry_f_preallocate_all;
+
+			if (rc < 0)
+			{
+				if (umfile_preallocate_errno_is_unsupported(errno))
+					return false;
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not preallocate file \"%s\" to %llu bytes: %m",
+								FilePathName(fd),
+								(unsigned long long) target_bytes)));
+			}
+		}
+
+		if (fst.fst_bytesalloc < delta_bytes)
+			return false;
+	}
+#elif defined(HAVE_POSIX_FALLOCATE)
+	{
+		int			rc;
+
+retry_posix_fallocate:
+		pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_EXTEND);
+		rc = posix_fallocate(rawfd, current_bytes, delta_bytes);
+		pgstat_report_wait_end();
+
+		if (rc == EINTR)
+			goto retry_posix_fallocate;
+
+		if (rc != 0)
+		{
+			errno = rc;
+			if (umfile_preallocate_errno_is_unsupported(errno))
+				return false;
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not preallocate file \"%s\" to %llu bytes: %m",
+							FilePathName(fd),
+							(unsigned long long) target_bytes)));
+		}
+	}
+#else
+	return false;
+#endif
+
+	current_bytes = FileSize(fd);
+	if (current_bytes < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not determine file size for \"%s\": %m",
+						FilePathName(fd))));
+	if (current_bytes < target_bytes &&
+		FileTruncate(fd, target_bytes, WAIT_EVENT_DATA_FILE_EXTEND) < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not extend preallocated file \"%s\" to %llu bytes: %m",
+						FilePathName(fd),
+						(unsigned long long) target_bytes)));
+
+	return true;
+}
+
+static bool
+umfile_preallocate_errno_is_unsupported(int err)
+{
+	if (err == EINVAL || err == EOPNOTSUPP)
+		return true;
+#ifdef ENOSYS
+	if (err == ENOSYS)
+		return true;
+#endif
+#ifdef ENOTSUP
+	if (err == ENOTSUP)
+		return true;
+#endif
+	return false;
+}
+
 void
 umfile_zeroextend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blocknum,
 				  int nblocks, bool skipFsync)
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index a1de5a08d4..ec5e2eabf4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -61,6 +61,8 @@ IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
+MAPWRITER_HIBERNATE	"Waiting in Umbra map writer process, hibernating."
+MAPWRITER_MAIN	"Waiting in main loop of Umbra map writer process."
 RECOVERY_WAL_STREAM	"Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
 REPLICATION_SLOTSYNC_MAIN	"Waiting in main loop of slot synchronization."
 REPLICATION_SLOTSYNC_SHUTDOWN	"Waiting for slot sync worker to shut down."
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 62476de48e..8f1d36de0a 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -823,10 +823,18 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		before_shmem_exit(ShutdownXLOG, 0);
 	}

-		/*
-		 * Initialize the relation cache and the system catalog caches.  Note that
-		 * no catalog access happens here; we only set up the hashtable structure.
-		 * We must do this before starting a transaction because transaction abort
+	/*
+	 * Let the active storage manager register backend-local shutdown cleanup
+	 * after ShutdownXLOG. That way, standalone shutdown runs this cleanup
+	 * before the shutdown checkpoint, without exposing storage-manager-specific
+	 * details here.
+	 */
+	smgrregistershutdowncleanup();
+
+	/*
+	 * Initialize the relation cache and the system catalog caches.  Note that
+	 * no catalog access happens here; we only set up the hashtable structure.
+	 * We must do this before starting a transaction because transaction abort
 	 * would try to touch these hashtables.
 	 */
 	RelationCacheInitialize();
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 86c1eba5da..f3726be78d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1932,6 +1932,133 @@
   max => 'MAX_KILOBYTES',
 },

+{ name => 'map_prealloc_fsm_batch', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Preallocation batch size in blocks for Umbra FSM fork.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_fsm_batch',
+  boot_val => '128',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_fsm_hard', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Hard low-water mark in blocks before foreground Umbra FSM writes preallocate directly.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_fsm_hard',
+  boot_val => '16',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_fsm_low', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Soft low-water mark in blocks before Umbra FSM preallocation is considered.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_fsm_low',
+  boot_val => '64',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_main_batch', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Preallocation batch size in blocks for Umbra main fork.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_main_batch',
+  boot_val => '1024',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_main_hard', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Hard low-water mark in blocks before foreground Umbra main-fork writes preallocate directly.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_main_hard',
+  boot_val => '128',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_main_low', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Soft low-water mark in blocks before Umbra main-fork preallocation is considered.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_main_low',
+  boot_val => '512',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_vm_batch', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Preallocation batch size in blocks for Umbra VM fork.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_vm_batch',
+  boot_val => '128',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_vm_hard', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Hard low-water mark in blocks before foreground Umbra VM writes preallocate directly.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_vm_hard',
+  boot_val => '16',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_prealloc_vm_low', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Soft low-water mark in blocks before Umbra VM preallocation is considered.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_prealloc_vm_low',
+  boot_val => '64',
+  min => '1',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapwriter_delay', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Umbra map writer sleep time between rounds.',
+  flags => 'GUC_UNIT_MS',
+  variable => 'MapWriterDelay',
+  boot_val => '200',
+  min => '1',
+  max => '10000',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapwriter_lru_maxpages', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Umbra map writer maximum number of MAP pages to flush per round.',
+  long_desc => '0 disables Umbra map writer cleaning.',
+  variable => 'MapWriterMaxPages',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapwriter_lru_multiplier', type => 'real', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Multiple of recent MAP allocation pressure to clean per Umbra map writer round.',
+  variable => 'MapWriterLRUMultiplier',
+  boot_val => '2.0',
+  min => '0.0',
+  max => '10.0',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapwriter_prealloc_max_relations', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Maximum number of relations preallocated by Umbra map writer per round.',
+  variable => 'MapWriterPreallocMaxRelations',
+  boot_val => '32',
+  min => '0',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
 { name => 'max_active_replication_origins', type => 'int', context => 'PGC_POSTMASTER', group => 'REPLICATION_SUBSCRIBERS',
   short_desc => 'Sets the maximum number of active replication origins.',
   variable => 'max_active_replication_origins',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 290ccbc543..8d87958923 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -67,6 +67,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/mapwriter.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/startup.h"
 #include "postmaster/syslogger.h"
@@ -83,6 +84,7 @@
 #include "storage/fd.h"
 #include "storage/io_worker.h"
 #include "storage/large_object.h"
+#include "storage/map.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f2bbf0529..477d75f7e9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -135,6 +135,8 @@

 #shared_buffers = 128MB                 # min 128kB
                                         # (change requires restart)
+#map_superblocks = 262144               # dedicated Umbra MAP superblock slots
+                                        # (change requires restart)
 #huge_pages = try                       # on, off, or try
                                         # (change requires restart)
 #huge_page_size = 0                     # zero for system default
diff --git a/src/include/postmaster/mapwriter.h b/src/include/postmaster/mapwriter.h
new file mode 100644
index 0000000000..6c984922b0
--- /dev/null
+++ b/src/include/postmaster/mapwriter.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapwriter.h
+ *	  Exports for Umbra map background workers.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/postmaster/mapwriter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MAPWRITER_H
+#define MAPWRITER_H
+
+extern PGDLLIMPORT int MapWriterDelay;
+extern PGDLLIMPORT int MapWriterMaxPages;
+extern PGDLLIMPORT int MapWriterPreallocMaxRelations;
+extern PGDLLIMPORT double MapWriterLRUMultiplier;
+
+extern void MapBackgroundWorkersRegister(void);
+extern void MapWriterMain(Datum arg);
+
+#endif							/* MAPWRITER_H */
diff --git a/src/include/storage/map.h b/src/include/storage/map.h
index ccbc392835..c61414fd16 100644
--- a/src/include/storage/map.h
+++ b/src/include/storage/map.h
@@ -88,6 +88,7 @@ typedef struct MapSharedData
 	pg_atomic_uint32 next_victim_buffer;
 	slock_t		clock_lock;
 	int			first_free_buffer;	/* head of free list, -1 if empty */
+	int			mapwriter_procno;	/* procno to wake, -1 if none */

 	/* statistics */
 	pg_atomic_uint32 num_allocs;
@@ -258,6 +259,10 @@ extern BlockNumber MapGetPhysicalBlockCount(UmbraFileContext *map_ctx,
 extern int	MapClockGetBuffer(void);
 extern void MapClockFreeBuffer(int slot_id);
 extern int	MapSyncStart(uint32 *complete_passes, uint32 *num_allocs);
+extern uint32 MapAllocPressurePeek(void);
+extern void MapStrategyNotifyWriter(int mapwriter_procno);
+extern void MapWakeWriter(void);
+extern int	MapPreallocStep(int max_relations);

 /* Map cache hash table (in mapclock.c) */
 extern int	MapCacheLookup(RelFileLocator rnode, ForkNumber forknum,
@@ -277,6 +282,15 @@ extern void MapInvalidateBuffer(int slot_id, RelFileLocator expected_rnode,
 /* GUCs */
 extern int	map_buffers;
 extern int	map_superblocks;
+extern int	map_prealloc_main_low;
+extern int	map_prealloc_main_hard;
+extern int	map_prealloc_main_batch;
+extern int	map_prealloc_fsm_low;
+extern int	map_prealloc_fsm_hard;
+extern int	map_prealloc_fsm_batch;
+extern int	map_prealloc_vm_low;
+extern int	map_prealloc_vm_hard;
+extern int	map_prealloc_vm_batch;

 /* Global data (defined in map.c) */
 extern MapSharedData *MapShared;
diff --git a/src/include/storage/map_internal.h b/src/include/storage/map_internal.h
index acac29b018..368b3da15a 100644
--- a/src/include/storage/map_internal.h
+++ b/src/include/storage/map_internal.h
@@ -27,6 +27,9 @@ extern void MapResetAllTruncatePreloads(void);
 extern BlockNumber MapForkPageIndexToMapBlkno(ForkNumber forknum,
 											  BlockNumber fork_page_idx);
 extern BlockNumber MapLblknoToMapBlkno(ForkNumber forknum, BlockNumber lblkno);
+extern bool MapForkPreallocSettings(ForkNumber forknum, BlockNumber *soft_low,
+									BlockNumber *hard_low,
+									BlockNumber *batch_blocks);
 extern bool MapReserveNextPblkno(UmbraFileContext *map_ctx, RelFileLocator rnode,
 								 ForkNumber forknum, BlockNumber lblkno,
 								 BlockNumber *new_pblkno, bool nowait);
@@ -36,6 +39,10 @@ extern bool MapTryReserveFreshPblkno(UmbraFileContext *map_ctx,
 									 BlockNumber lblkno,
 									 BlockNumber *new_pblkno,
 									 bool nowait);
+extern bool MapMaybePreallocateFork(UmbraFileContext *map_ctx,
+									RelFileLocator rnode,
+									ForkNumber forknum,
+									bool background_mode);
 extern bool MapInflightTryClaim(UmbraFileContext *map_ctx,
 								RelFileLocator rnode,
 								ForkNumber forknum,
diff --git a/src/include/storage/mapsuper_internal.h b/src/include/storage/mapsuper_internal.h
index 960469538f..5d64ddec87 100644
--- a/src/include/storage/mapsuper_internal.h
+++ b/src/include/storage/mapsuper_internal.h
@@ -17,6 +17,9 @@
 #define MAPSUPER_FLAG_DIRTY		0x02
 #define MAPSUPER_FLAG_CORRUPT	0x04

+#define MAPSUPER_RUNTIME_FLAG_PREALLOC_MAIN	0x01
+#define MAPSUPER_RUNTIME_FLAG_PREALLOC_FSM	0x02
+#define MAPSUPER_RUNTIME_FLAG_PREALLOC_VM	0x04
 #define MAPSUPER_RUNTIME_FLAG_EXTENDING_MAIN	0x08
 #define MAPSUPER_RUNTIME_FLAG_EXTENDING_FSM	0x10
 #define MAPSUPER_RUNTIME_FLAG_EXTENDING_VM	0x20
@@ -143,6 +146,7 @@ extern MapSuperEntry *MapSuperEnsureEntryLocked(RelFileLocator rnode);
 extern void MapSuperDeleteEntry(RelFileLocator rnode);
 extern bool MapSuperForkExists(const MapSuperblock *super,
 							   ForkNumber forknum);
+extern uint32 MapSuperPreallocFlag(ForkNumber forknum);
 extern void MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx,
 									   RelFileLocator rnode,
 									   ForkNumber forknum,
diff --git a/src/include/storage/umfile.h b/src/include/storage/umfile.h
index 8b7400140d..b965867572 100644
--- a/src/include/storage/umfile.h
+++ b/src/include/storage/umfile.h
@@ -61,6 +61,9 @@ extern void umfile_ctx_write(UmbraFileContext *ctx, ForkNumber forknum, BlockNum
 							 const char *buffer, int nbytes, bool skipFsync);
 extern void umfile_ctx_extend(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno,
 							  const char *buffer);
+extern bool umfile_ctx_preallocate_blocks(UmbraFileContext *ctx, ForkNumber forknum,
+										  UmFileNblocksMode mode,
+										  BlockNumber target_nblocks);
 extern void umfile_ctx_prefetch(UmbraFileContext *ctx, ForkNumber forknum, BlockNumber blkno);
 extern bool umfile_ctx_block_exists(UmbraFileContext *ctx, ForkNumber forknum,
 									BlockNumber blkno);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 55a2de4df7..da020abc31 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -63,6 +63,7 @@ tests += {
       't/052_checkpoint_segment_missing.pl',
       't/053_umbra_map_superblock_watermark.pl',
       't/054_umbra_map_fork_policy.pl',
+      't/055_umbra_mapwriter_activity.pl',
       't/056_umbra_truncate_superblock.pl',
       't/057_umbra_remap_crash_consistency.pl',
       't/058_umbra_2pc_remap_recovery.pl',
@@ -76,6 +77,7 @@ tests += {
       't/070_umbra_hash_birth_block_remap.pl',
       't/071_umbra_skip_wal_dense_map.pl',
       't/072_umbra_ordinary_slim_block_remap.pl',
+      't/073_umbra_preallocate_guc.pl',
       't/074_umbra_torn_page_remap.pl',
     ],
   },
diff --git a/src/test/recovery/t/055_umbra_mapwriter_activity.pl b/src/test/recovery/t/055_umbra_mapwriter_activity.pl
new file mode 100644
index 0000000000..6ebae116d8
--- /dev/null
+++ b/src/test/recovery/t/055_umbra_mapwriter_activity.pl
@@ -0,0 +1,56 @@
+# Verify mapwriter process visibility and basic activity metadata.
+#
+# In UMBRA mode:
+# - map writer backend should exist in pg_stat_activity
+# - wait event should be MapwriterMain/MapwriterHibernate or NULL transiently
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+});
+$node->start();
+
+$node->safe_psql('postgres', q{CREATE TABLE umb_mapwriter_t(a int, b text);});
+
+my $mapwriter_cnt = $node->safe_psql(
+	'postgres',
+	q{SELECT count(*) FROM pg_stat_activity WHERE backend_type = 'map writer';});
+is($mapwriter_cnt, '1', 'map writer backend exists');
+
+my $mapwriter_wait_ok = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) > 0
+FROM pg_stat_activity
+WHERE backend_type = 'map writer'
+  AND (wait_event IN ('MapwriterMain', 'MapwriterHibernate')
+	   OR wait_event IS NULL);
+});
+is($mapwriter_wait_ok, 't', 'map writer wait event is expected');
+
+# Create allocation pressure and ensure map writer remains visible.
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO umb_mapwriter_t
+SELECT g, repeat('w', 300) FROM generate_series(1, 30000) g;
+CHECKPOINT;
+});
+
+ok($node->poll_query_until('postgres',
+	q{SELECT count(*) = 1 FROM pg_stat_activity WHERE backend_type = 'map writer';},
+	't'),
+	'map writer remains alive under allocation pressure');
+
+done_testing();
diff --git a/src/test/recovery/t/073_umbra_preallocate_guc.pl b/src/test/recovery/t/073_umbra_preallocate_guc.pl
new file mode 100644
index 0000000000..f5089d50bb
--- /dev/null
+++ b/src/test/recovery/t/073_umbra_preallocate_guc.pl
@@ -0,0 +1,74 @@
+# Verify Umbra MAIN-fork preallocation publishes capacity only after the
+# underlying file has been extended to cover it.
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+sub u32le_from_hex
+{
+	my ($hex, $offset) = @_;
+	my $chunk = substr($hex, $offset * 2, 8);
+	my @b = ($chunk =~ /../g);
+
+	return hex($b[0]) +
+	  (hex($b[1]) << 8) +
+	  (hex($b[2]) << 16) +
+	  (hex($b[3]) << 24);
+}
+
+my $node = PostgreSQL::Test::Cluster->new('umbra_preallocate_guc');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+map_prealloc_main_low = 64
+map_prealloc_main_hard = 64
+map_prealloc_main_batch = 128
+});
+$node->start();
+
+$node->safe_psql(
+	'postgres', q{
+CREATE TABLE umbra_prealloc_t(id int, payload text);
+ALTER TABLE umbra_prealloc_t ALTER COLUMN payload SET STORAGE PLAIN;
+INSERT INTO umbra_prealloc_t
+SELECT g, repeat('x', 7000) FROM generate_series(1, 2000) g;
+CHECKPOINT;
+});
+
+my $main_path = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_filepath('umbra_prealloc_t');}
+);
+
+my $map_super_hex = $node->safe_psql(
+	'postgres',
+	q{SELECT encode(pg_read_binary_file(pg_relation_filepath('umbra_prealloc_t') || '_map', 0, 64, true), 'hex');}
+);
+
+my $next_free_main = u32le_from_hex($map_super_hex, 16);
+my $phys_capacity_main = u32le_from_hex($map_super_hex, 20);
+my $logical_main = u32le_from_hex($map_super_hex, 40);
+my $main_file_blocks = $node->safe_psql(
+	'postgres',
+	"SELECT ((pg_stat_file('$main_path')).size / current_setting('block_size')::int)::bigint;");
+
+cmp_ok($logical_main, '>', 0, 'table has non-zero logical size');
+cmp_ok($next_free_main, '>=', $logical_main,
+	'next_free_phys_block_main covers logical blocks');
+cmp_ok($phys_capacity_main, '>', $next_free_main,
+	'GUC-driven preallocation keeps capacity ahead of next_free');
+cmp_ok($main_file_blocks, '>=', $phys_capacity_main,
+	'MAIN fork file size covers published physical capacity');
+
+$node->stop;
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

#12

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Mingwei Jia (#4)

[RFC PATCH v2 RESEND 10/10] umbra: add patch 9 compactor framework and non-interference policy

---
src/backend/access/rmgrdesc/umbradesc.c | 11 +
src/backend/access/transam/umbra_xlog.c | 43 +
src/backend/postmaster/Makefile | 3 +-
src/backend/postmaster/bgworker.c | 4 +
src/backend/postmaster/mapcompactor.c | 151 ++++
src/backend/postmaster/mapwriter.c | 14 +
src/backend/postmaster/meson.build | 1 +
src/backend/storage/map/map.c | 12 +
src/backend/storage/map/mapbgproc.c | 748 +++++++++++++++++-
src/backend/storage/map/mapinit.c | 81 ++
src/backend/storage/map/mapsuper.c | 163 ++++
src/backend/storage/smgr/smgr.c | 44 ++
src/backend/storage/smgr/umbra.c | 24 +
src/backend/storage/sync/sync.c | 101 ++-
.../utils/activity/wait_event_names.txt | 2 +
src/backend/utils/adt/pgstatfuncs.c | 25 +
src/backend/utils/misc/guc_parameters.dat | 75 ++
src/include/access/umbra_xlog.h | 12 +
src/include/catalog/pg_proc.dat | 20 +
src/include/postmaster/mapwriter.h | 4 +
src/include/storage/map.h | 20 +
src/include/storage/map_internal.h | 1 +
src/include/storage/mapsuper_internal.h | 13 +
src/include/storage/smgr.h | 4 +
src/include/storage/sync.h | 3 +-
src/include/storage/umbra.h | 4 +
src/test/recovery/meson.build | 4 +
.../t/059_umbra_compactor_relocation.pl | 91 +++
.../060_umbra_reclaim_checkpoint_counters.pl | 82 ++
...64_umbra_mainfork_internal_reclaim_seg0.pl | 283 +++++++
...umbra_mainfork_middle_reclaim_keep_seg0.pl | 356 +++++++++
31 files changed, 2381 insertions(+), 18 deletions(-)
create mode 100644 src/backend/postmaster/mapcompactor.c
create mode 100644 src/test/recovery/t/059_umbra_compactor_relocation.pl
create mode 100644 src/test/recovery/t/060_umbra_reclaim_checkpoint_counters.pl
create mode 100644 src/test/recovery/t/064_umbra_mainfork_internal_reclaim_seg0.pl
create mode 100644 src/test/recovery/t/065_umbra_mainfork_middle_reclaim_keep_seg0.pl

diff --git a/src/backend/access/rmgrdesc/umbradesc.c b/src/backend/access/rmgrdesc/umbradesc.c
index a6b3e6e55e..07c5112974 100644
--- a/src/backend/access/rmgrdesc/umbradesc.c
+++ b/src/backend/access/rmgrdesc/umbradesc.c
@@ -78,6 +78,14 @@ umbra_desc(StringInfo buf, XLogReaderState *record)
 							 xlrec->entries[i].forknum,
 							 xlrec->entries[i].nblocks);
 	}
+	else if (info == XLOG_UMBRA_RECLAIM_UNLINK)
+	{
+		xl_umbra_reclaim_unlink *xlrec = (xl_umbra_reclaim_unlink *) rec;
+		RelPathStr	path = umbra_fork_relpath(xlrec->rlocator, xlrec->forknum);
+
+		appendStringInfo(buf, "%s seg %u reclaim_unlink",
+						 path.str, xlrec->segno);
+	}
 }

const char *
@@ -99,6 +107,9 @@ umbra_identify(uint8 info)
case XLOG_UMBRA_SKIP_WAL_DENSE_MAP:
id = "SKIP_WAL_DENSE_MAP";
break;
+ case XLOG_UMBRA_RECLAIM_UNLINK:
+ id = "RECLAIM_UNLINK";
+ break;
}

 	return id;
diff --git a/src/backend/access/transam/umbra_xlog.c b/src/backend/access/transam/umbra_xlog.c
index 186eca102e..dd384a3f04 100644
--- a/src/backend/access/transam/umbra_xlog.c
+++ b/src/backend/access/transam/umbra_xlog.c
@@ -13,6 +13,7 @@
 #include "access/xloginsert.h"
 #include "storage/map.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "storage/umbra.h"
 #include "storage/umfile.h"

@@ -113,6 +114,23 @@ log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
XLOG_UMBRA_SKIP_WAL_DENSE_MAP | XLR_SPECIAL_REL_UPDATE);
}

+XLogRecPtr
+log_umbra_reclaim_unlink(RelFileLocator rlocator, ForkNumber forknum,
+						 BlockNumber segno)
+{
+	xl_umbra_reclaim_unlink xlrec;
+
+	xlrec.rlocator = rlocator;
+	xlrec.forknum = forknum;
+	xlrec.segno = segno;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+	return XLogInsert(RM_UMBRA_ID,
+					  XLOG_UMBRA_RECLAIM_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
 void
 umbra_redo(XLogReaderState *record)
 {
@@ -317,6 +335,31 @@ umbra_redo(XLogReaderState *record)
 			}
 			break;

+		case XLOG_UMBRA_RECLAIM_UNLINK:
+			{
+				xl_umbra_reclaim_unlink *xlrec;
+				FileTag		tag;
+				char		path[MAXPGPATH];
+				int			ret;
+
+				xlrec = (xl_umbra_reclaim_unlink *) XLogRecGetData(record);
+				tag.handler = SYNC_HANDLER_UMBRA;
+				tag.forknum = xlrec->forknum;
+				tag.rlocator = xlrec->rlocator;
+				tag.segno = (uint64) xlrec->segno;
+
+				/*
+				 * Recovery consumes reclaim targets eagerly. ENOENT is fine
+				 * because replay can race with prior deletion on the same path.
+				 */
+				ret = umunlinkfiletag(&tag, path);
+				if (ret < 0 && errno != ENOENT)
+					ereport(WARNING,
+							(errcode_for_file_access(),
+							 errmsg("could not remove file \"%s\": %m", path)));
+			}
+			break;
+
 		default:
 			elog(PANIC, "umbra_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 05cb330024..f54d704ae3 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -32,7 +32,8 @@ OBJS = \

 ifeq ($(with_umbra), yes)
 OBJS += \
-	mapwriter.o
+	mapwriter.o \
+	mapcompactor.o
 endif

 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 45f0abf94a..915de09219 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -175,6 +175,10 @@ static const struct
 	{
 		.fn_name = "MapWriterMain",
 		.fn_addr = MapWriterMain
+	},
+	{
+		.fn_name = "MapCompactorMain",
+		.fn_addr = MapCompactorMain
 	}
 #endif
 };
diff --git a/src/backend/postmaster/mapcompactor.c b/src/backend/postmaster/mapcompactor.c
new file mode 100644
index 0000000000..f657435849
--- /dev/null
+++ b/src/backend/postmaster/mapcompactor.c
@@ -0,0 +1,151 @@
+/*-------------------------------------------------------------------------
+ *
+ * mapcompactor.c
+ *	  Umbra map compactor background worker.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/mapcompactor.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/mapwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/map.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+#define MAPCOMPACTOR_HIBERNATE_FACTOR 50
+
+int			MapCompactorDelay = 200;
+int			MapCompactorMaxRelations = 8;
+int			MapCompactorBusyAllocThreshold = 128;
+
+static void
+MapCompactorExitCallback(int code, Datum arg)
+{
+	(void) code;
+	(void) arg;
+	MapStrategyNotifyCompactor(INVALID_PROC_NUMBER);
+}
+
+void
+MapCompactorMain(Datum arg)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext mapcompactor_context;
+	bool		prev_hibernate = false;
+
+	(void) arg;
+	before_shmem_exit(MapCompactorExitCallback, 0);
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+	pqsignal(SIGCHLD, SIG_DFL);
+
+	BackgroundWorkerUnblockSignals();
+	BackgroundWorkerInitializeConnectionByOid(InvalidOid, InvalidOid, 0);
+
+	mapcompactor_context = AllocSetContextCreate(TopMemoryContext,
+													 "Map Compactor",
+												 ALLOCSET_DEFAULT_SIZES);
+	MemoryContextSwitchTo(mapcompactor_context);
+
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+		EmitErrorReport();
+
+		LWLockReleaseAll();
+		ConditionVariableCancelSleep();
+		pgstat_report_wait_end();
+		MapBackendExitCleanup();
+		AtEOXact_Buffers(false);
+		AtEOXact_SMgr();
+		AtEOXact_Files(false);
+		AtEOXact_HashTables(false);
+
+		MemoryContextSwitchTo(mapcompactor_context);
+		FlushErrorState();
+		MemoryContextReset(mapcompactor_context);
+		RESUME_INTERRUPTS();
+
+		pg_usleep(1000000L);
+		smgrreleaseall();
+	}
+
+	PG_exception_stack = &local_sigjmp_buf;
+
+	for (;;)
+	{
+		int			compact_moves = 0;
+		uint32		alloc_pressure = 0;
+		bool		busy_round = false;
+
+		ResetLatch(MyLatch);
+		ProcessMainLoopInterrupts();
+
+		alloc_pressure = MapAllocPressurePeek();
+		busy_round = (MapCompactorBusyAllocThreshold > 0 &&
+					  alloc_pressure >= (uint32) MapCompactorBusyAllocThreshold);
+
+		if (!busy_round && MapCompactorMaxRelations > 0)
+			compact_moves = MapCompactorStep(MapCompactorMaxRelations);
+
+		if (FirstCallSinceLastCheckpoint())
+			smgrreleaseall();
+
+		MapStrategyNotifyCompactor(MyProcNumber);
+		if (busy_round)
+		{
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 MapCompactorDelay * MAPCOMPACTOR_HIBERNATE_FACTOR,
+							 WAIT_EVENT_MAPCOMPACTOR_HIBERNATE);
+		}
+		else if (WaitLatch(MyLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						   MapCompactorDelay,
+						   WAIT_EVENT_MAPCOMPACTOR_MAIN) == WL_TIMEOUT &&
+				 compact_moves == 0 &&
+				 prev_hibernate)
+		{
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 MapCompactorDelay * MAPCOMPACTOR_HIBERNATE_FACTOR,
+							 WAIT_EVENT_MAPCOMPACTOR_HIBERNATE);
+		}
+		MapStrategyNotifyCompactor(INVALID_PROC_NUMBER);
+		prev_hibernate = (compact_moves == 0 || busy_round);
+	}
+}
diff --git a/src/backend/postmaster/mapwriter.c b/src/backend/postmaster/mapwriter.c
index e659b6be94..3cd81815cb 100644
--- a/src/backend/postmaster/mapwriter.c
+++ b/src/backend/postmaster/mapwriter.c
@@ -32,6 +32,7 @@
 #include "storage/procnumber.h"
 #include "storage/procsignal.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/wait_event.h"

@@ -67,6 +68,19 @@ MapBackgroundWorkersRegister(void)
 	bgw.bgw_notify_pid = 0;
 	bgw.bgw_main_arg = (Datum) 0;
 	RegisterBackgroundWorker(&bgw);
+
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "MapCompactorMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "Umbra mapcompactor");
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "map compactor");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+	RegisterBackgroundWorker(&bgw);
 }

 void
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 0a30057703..780078e366 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -22,5 +22,6 @@ backend_sources += files(
 if get_option('umbra').enabled()
   backend_sources += files(
     'mapwriter.c',
+    'mapcompactor.c',
   )
 endif
diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
index 6793db8671..62d3db5cdf 100644
--- a/src/backend/storage/map/map.c
+++ b/src/backend/storage/map/map.c
@@ -669,6 +669,8 @@ MapTryReserveFreshPblknoInternal(UmbraFileContext *map_ctx, RelFileLocator rnode
 							rnode.spcOid, rnode.dbOid, rnode.relNumber, forknum)));
 		}

+		Assert(MapNormalizeForkBlockCount(forknum,
+										  MapSuperGetReclaimBoundary(entry, forknum)) <= next);
 		MapSuperSetReservedNextFree(entry, forknum, next + 1);
 		Assert(MapNormalizeForkBlockCount(forknum,
 										  MapSuperblockGetNextFreePhysBlock(&entry->super,
@@ -1197,6 +1199,13 @@ MapInvalidateRelation(RelFileLocator rnode)

 	/* Remove dedicated superblock cache entry for this relation. */
 	MapSuperDeleteEntry(rnode);
+
+	/*
+	 * Relation lifecycle ended (drop/unlink path). Purge queued reclaim tasks
+	 * so post-checkpoint workers cannot act on a future relation that reuses
+	 * the same relfilenode.
+	 */
+	MapReclaimForgetRelation(rnode);
 }

static bool
@@ -1293,7 +1302,10 @@ MapInvalidateDatabaseTablespaces(Oid dbid, int ntablespaces,
}

 		for (i = 0; i < target_count; i++)
+		{
 			MapSuperDeleteEntry(targets[i]);
+			MapReclaimForgetRelation(targets[i]);
+		}

 		pfree(targets);
 	}
diff --git a/src/backend/storage/map/mapbgproc.c b/src/backend/storage/map/mapbgproc.c
index 3bb167bae9..226abddb1d 100644
--- a/src/backend/storage/map/mapbgproc.c
+++ b/src/backend/storage/map/mapbgproc.c
@@ -7,17 +7,67 @@
  */
 #include "postgres.h"

+#include "access/umbra_xlog.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "miscadmin.h"
-#include "storage/latch.h"
+#include "storage/buf_internals.h"
#include "storage/map.h"
#include "storage/map_internal.h"
#include "storage/mapsuper_internal.h"
#include "storage/proc.h"
+#include "storage/sync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"

 static bool MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
 								 bool background_mode);
+static bool MapSegmentHasLiveReferences(UmbraFileContext *map_ctx,
+										RelFileLocator rnode,
+										ForkNumber forknum,
+										BlockNumber segno);
+static void MapReclaimInitFileTag(FileTag *tag, RelFileLocator rnode,
+								  ForkNumber forknum, BlockNumber segno);
+static bool MapReclaimRegisterUnlinkRequest(RelFileLocator rnode,
+											ForkNumber forknum,
+											BlockNumber segno);
+static bool MapReclaimEnqueueSegment(UmbraFileContext *map_ctx,
+									 RelFileLocator rnode, ForkNumber forknum,
+									 BlockNumber segno);
+static void MapReclaimRegisterRelationFilterRequest(RelFileLocator rnode);
+static bool MapReclaimEnqueue(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							  ForkNumber forknum,
+							  BlockNumber extent_no,
+							  BlockNumber extent_blocks);
+static bool MapSegmentFullyBelowReclaimBoundary(BlockNumber boundary_pblk,
+												BlockNumber segno);
+static bool MapExtentFullyBelowReclaimBoundary(BlockNumber boundary_pblk,
+											   BlockNumber extent_no,
+											   BlockNumber extent_blocks);
+static BlockNumber MapCompactorAdvanceBoundaryFromSet(BlockNumber boundary_pblk,
+													  BlockNumber next_free_pblk,
+													  HTAB *committed_pblks);
+static bool MapCompactorRelocateEntry(UmbraFileContext *map_ctx,
+									  RelFileLocator rnode,
+									  ForkNumber forknum,
+									  BlockNumber lblkno,
+									  BlockNumber old_pblkno);
+static int	MapCompactorAnalyzeFork(UmbraFileContext *map_ctx,
+									RelFileLocator rnode,
+									ForkNumber forknum,
+									int max_moves,
+									int *moves_done);
+
+typedef struct MapExtentLiveEntry
+{
+	BlockNumber	extent_no;
+	uint32		live_blocks;
+} MapExtentLiveEntry;
+
+typedef struct MapBoundaryPblkEntry
+{
+	BlockNumber	pblkno;
+} MapBoundaryPblkEntry;

uint32
MapAllocPressurePeek(void)
@@ -48,6 +98,29 @@ MapWakeWriter(void)
SetLatch(&ProcGlobal->allProcs[mapwriter_procno].procLatch);
}

+void
+MapStrategyNotifyCompactor(int mapcompactor_procno)
+{
+	SpinLockAcquire(&MapShared->clock_lock);
+	MapShared->mapcompactor_procno = mapcompactor_procno;
+	SpinLockRelease(&MapShared->clock_lock);
+}
+
+void
+MapWakeCompactor(void)
+{
+	int			mapcompactor_procno = -1;
+
+	SpinLockAcquire(&MapShared->clock_lock);
+	mapcompactor_procno = MapShared->mapcompactor_procno;
+	if (mapcompactor_procno != -1)
+		MapShared->mapcompactor_procno = -1;
+	SpinLockRelease(&MapShared->clock_lock);
+
+	if (mapcompactor_procno != -1)
+		SetLatch(&ProcGlobal->allProcs[mapcompactor_procno].procLatch);
+}
+
 bool
 MapMaybePreallocateFork(UmbraFileContext *map_ctx, RelFileLocator rnode,
 						ForkNumber forknum, bool background_mode)
@@ -180,8 +253,7 @@ MapMaybePreallocateFork(UmbraFileContext *map_ctx, RelFileLocator rnode,
 			entry->in_use &&
 			(entry->flags & MAPSUPER_FLAG_VALID) != 0 &&
 			MapNormalizeForkBlockCount(forknum,
-									   MapSuperblockGetPhysCapacity(&entry->super,
-																   forknum)) < target_nblocks)
+									   MapSuperblockGetPhysCapacity(&entry->super, forknum)) < target_nblocks)
 		{
 			XLogRecPtr	map_lsn = GetXLogWriteRecPtr();

@@ -200,6 +272,50 @@ MapMaybePreallocateFork(UmbraFileContext *map_ctx, RelFileLocator rnode,
return prealloc_ok;
}

+static bool
+MapSegmentFullyBelowReclaimBoundary(BlockNumber boundary_pblk, BlockNumber segno)
+{
+	uint64		seg_end;
+
+	seg_end = ((uint64) segno + 1) * (uint64) RELSEG_SIZE;
+	return seg_end <= (uint64) boundary_pblk;
+}
+
+static bool
+MapExtentFullyBelowReclaimBoundary(BlockNumber boundary_pblk,
+								   BlockNumber extent_no,
+								   BlockNumber extent_blocks)
+{
+	uint64		extent_end;
+
+	if (extent_blocks == 0)
+		return false;
+
+	extent_end = ((uint64) extent_no + 1) * (uint64) extent_blocks;
+	return extent_end <= (uint64) boundary_pblk;
+}
+
+static BlockNumber
+MapCompactorAdvanceBoundaryFromSet(BlockNumber boundary_pblk,
+								   BlockNumber next_free_pblk,
+								   HTAB *committed_pblks)
+{
+	while (boundary_pblk < next_free_pblk)
+	{
+		MapBoundaryPblkEntry *entry;
+
+		entry = (MapBoundaryPblkEntry *) hash_search(committed_pblks,
+													 &boundary_pblk,
+													 HASH_FIND,
+													 NULL);
+		if (entry == NULL)
+			break;
+		boundary_pblk++;
+	}
+
+	return boundary_pblk;
+}
+
 static bool
 MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
 					 bool background_mode)
@@ -231,7 +347,7 @@ MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
 	next = MapSuperGetReservedNextFree(entry, forknum);
 	capacity = MapNormalizeForkBlockCount(forknum,
 										  MapSuperblockGetPhysCapacity(&entry->super,
-																   forknum));
+																	   forknum));

if (next < soft_low)
return false;
@@ -249,6 +365,158 @@ MapForkNeedsPrealloc(const MapSuperEntry *entry, ForkNumber forknum,
return true;
}

+static bool
+MapSegmentHasLiveReferences(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							ForkNumber forknum, BlockNumber segno)
+{
+	BlockNumber	n_lblknos = 0;
+	BlockNumber	n_map_pages;
+	BlockNumber	current_page = InvalidBlockNumber;
+	BlockNumber	page_idx;
+	BlockNumber	page_count;
+	int			current_slot = -1;
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &n_lblknos) ||
+		n_lblknos == 0)
+		return false;
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return false;
+	n_map_pages = umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+										 UMFILE_NBLOCKS_DENSE);
+	if (n_map_pages == 0)
+		return false;
+
+	page_count = (n_lblknos + MAP_ENTRIES_PER_PAGE - 1) / MAP_ENTRIES_PER_PAGE;
+	for (page_idx = 0; page_idx < page_count; page_idx++)
+	{
+		BlockNumber	page_no = MapForkPageIndexToMapBlkno(forknum, page_idx);
+		int			entry_idx;
+		int			limit_idx;
+		MapPage	   *page;
+		MapBufferDesc *buf;
+
+		if (page_no >= n_map_pages)
+			break;
+
+		if (page_no != current_page)
+		{
+			if (current_slot >= 0)
+				MapUnpinBuffer(current_slot);
+			current_slot = MapReadBuffer(map_ctx, rnode, forknum, page_no);
+			current_page = page_no;
+		}
+
+		buf = &MapBuffers[current_slot];
+		page = MapGetPage(current_slot);
+		limit_idx = MAP_ENTRIES_PER_PAGE;
+		if (page_idx == page_count - 1 && (n_lblknos % MAP_ENTRIES_PER_PAGE) != 0)
+			limit_idx = n_lblknos % MAP_ENTRIES_PER_PAGE;
+
+		LWLockAcquire(&buf->buffer_lock, LW_SHARED);
+		for (entry_idx = 0; entry_idx < limit_idx; entry_idx++)
+		{
+			BlockNumber pblkno = page->pblknos[entry_idx];
+
+			if (pblkno != InvalidBlockNumber &&
+				pblkno / ((BlockNumber) RELSEG_SIZE) == segno)
+			{
+				LWLockRelease(&buf->buffer_lock);
+				if (current_slot >= 0)
+					MapUnpinBuffer(current_slot);
+				return true;
+			}
+		}
+		LWLockRelease(&buf->buffer_lock);
+	}
+
+	if (current_slot >= 0)
+		MapUnpinBuffer(current_slot);
+
+	return false;
+}
+
+static void
+MapReclaimInitFileTag(FileTag *tag, RelFileLocator rnode,
+					  ForkNumber forknum, BlockNumber segno)
+{
+	tag->handler = SYNC_HANDLER_UMBRA;
+	tag->forknum = forknum;
+	tag->rlocator = rnode;
+	tag->segno = (uint64) segno;
+}
+
+static bool
+MapReclaimRegisterUnlinkRequest(RelFileLocator rnode, ForkNumber forknum,
+								BlockNumber segno)
+{
+	FileTag		tag;
+
+	MapReclaimInitFileTag(&tag, rnode, forknum, segno);
+	return RegisterSyncRequest(&tag, SYNC_RECLAIM_REQUEST,
+							   true /* retryOnError */ );
+}
+
+static bool
+MapReclaimEnqueueSegment(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						 ForkNumber forknum, BlockNumber segno)
+{
+	BlockNumber	reclaim_boundary_pblk;
+
+	if (!MapSBlockTryGetReclaimBoundary(map_ctx, rnode, forknum,
+										&reclaim_boundary_pblk))
+		return false;
+	if (!MapSegmentFullyBelowReclaimBoundary(reclaim_boundary_pblk, segno))
+		return false;
+
+	if (!MapReclaimRegisterUnlinkRequest(rnode, forknum, segno))
+		return false;
+
+	elog(DEBUG1,
+		 "map reclaim enqueue rel %u/%u/%u fork %u seg %u",
+		 rnode.spcOid, rnode.dbOid, rnode.relNumber,
+		 forknum, segno);
+
+	MapStatsAddReclaimEnqueued(1);
+	return true;
+}
+
+static void
+MapReclaimRegisterRelationFilterRequest(RelFileLocator rnode)
+{
+	FileTag		tag;
+
+	MapReclaimInitFileTag(&tag, rnode, InvalidForkNumber, InvalidBlockNumber);
+	(void) RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST,
+							   true /* retryOnError */ );
+}
+
+static bool
+MapReclaimEnqueue(UmbraFileContext *map_ctx, RelFileLocator rnode,
+				  ForkNumber forknum,
+				  BlockNumber extent_no, BlockNumber extent_blocks)
+{
+	uint64		start64;
+	BlockNumber	segno;
+
+	if (extent_blocks == 0)
+		return false;
+
+	start64 = (uint64) extent_no * (uint64) extent_blocks;
+	if (start64 > (uint64) MaxBlockNumber)
+		return false;
+	segno = ((BlockNumber) start64) / ((BlockNumber) RELSEG_SIZE);
+
+	return MapReclaimEnqueueSegment(map_ctx, rnode, forknum, segno);
+}
+
+void
+MapReclaimForgetRelation(RelFileLocator rnode)
+{
+	MapReclaimRegisterRelationFilterRequest(rnode);
+}
+
 int
 MapPreallocStep(int max_relations)
 {
@@ -321,3 +589,475 @@ MapPreallocStep(int max_relations)

 	return prealloc_ops;
 }
+
+static bool
+MapCompactorRelocateEntry(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						  ForkNumber forknum, BlockNumber lblkno,
+						  BlockNumber old_pblkno)
+{
+	BlockNumber	map_blkno;
+	int			entry_idx;
+	int			slot_id;
+	MapPage	   *page;
+	MapBufferDesc *buf;
+	BlockNumber	cur_pblkno;
+	BlockNumber	new_pblkno;
+	XLogRecPtr	map_lsn;
+	char		pagebuf[BLCKSZ];
+
+	if (!MapTryReserveFreshPblkno(map_ctx, rnode, forknum, lblkno,
+								  &new_pblkno, true))
+	{
+		return false;
+	}
+
+	if (!umfile_ctx_block_exists(map_ctx, forknum, old_pblkno))
+	{
+		MapInflightRelease(rnode, forknum, lblkno);
+		return false;
+	}
+
+	/*
+	 * Compactor relocation is a raw physical copy, not a shared-buffer page
+	 * copy. If a newer image is still resident/dirty in buffer pool, this path
+	 * will not refresh that buffer's PageLSN or cached contents.
+	 */
+	umfile_ctx_prefetch(map_ctx, forknum, old_pblkno);
+	umfile_ctx_read(map_ctx, forknum, old_pblkno, pagebuf, BLCKSZ);
+	umfile_ctx_extend(map_ctx, forknum, new_pblkno, pagebuf);
+
+	umfile_ctx_register_dirty(map_ctx, forknum, new_pblkno, false, false);
+
+	map_blkno = MapLblknoToMapBlkno(forknum, lblkno);
+	entry_idx = map_blkno % MAP_ENTRIES_PER_PAGE;
+	map_blkno = map_blkno / MAP_ENTRIES_PER_PAGE;
+
+	slot_id = MapReadBuffer(map_ctx, rnode, forknum, map_blkno);
+	buf = &MapBuffers[slot_id];
+	page = MapGetPage(slot_id);
+	if (!LWLockConditionalAcquire(&buf->buffer_lock, LW_EXCLUSIVE))
+	{
+		MapUnpinBuffer(slot_id);
+		MapInflightRelease(rnode, forknum, lblkno);
+		return false;
+	}
+
+	cur_pblkno = page->pblknos[entry_idx];
+	if (cur_pblkno != old_pblkno)
+	{
+		LWLockRelease(&buf->buffer_lock);
+		MapUnpinBuffer(slot_id);
+		MapInflightRelease(rnode, forknum, lblkno);
+		return false;
+	}
+
+	map_lsn = log_umbra_map_set(rnode, forknum, lblkno, old_pblkno, new_pblkno);
+	page->pblknos[entry_idx] = new_pblkno;
+	MapMarkBufferDirty(map_ctx, buf, map_lsn);
+
+	LWLockRelease(&buf->buffer_lock);
+	MapUnpinBuffer(slot_id);
+	MapSBlockBumpPhysicalState(map_ctx, rnode, forknum, new_pblkno + 1,
+							   true, true, map_lsn);
+	MapInflightRelease(rnode, forknum, lblkno);
+	MapStatsAddCompactorRelocations(1);
+
+	return true;
+}
+
+static int
+MapCompactorAnalyzeFork(UmbraFileContext *map_ctx, RelFileLocator rnode,
+						ForkNumber forknum,
+						int max_moves, int *moves_done)
+{
+	HASHCTL		ctl;
+	HASHCTL		boundary_ctl;
+	HTAB	   *extent_live;
+	HTAB	   *boundary_committed = NULL;
+	bool	   *segment_live = NULL;
+	MemoryContext extent_ctx;
+	HASH_SEQ_STATUS seq;
+	MapExtentLiveEntry *live_entry;
+	BlockNumber	n_lblknos = 0;
+	BlockNumber	n_map_pages;
+	BlockNumber	current_page = InvalidBlockNumber;
+	BlockNumber	page_idx;
+	BlockNumber	page_count;
+	BlockNumber	max_extent_no = InvalidBlockNumber;
+	BlockNumber	extent_blocks;
+	BlockNumber	next_free_pblk = 0;
+	BlockNumber	reclaim_boundary_pblk = 0;
+	BlockNumber	advanced_boundary_pblk = 0;
+	BlockNumber	reclaim_seg_limit = 0;
+	int			current_slot = -1;
+	int			sparse_count = 0;
+	int			live_threshold;
+	int			moves_limit;
+	int			moved_here = 0;
+
+	if (!map_compactor_enable)
+		return 0;
+	if (!MapForkHasMappedState(forknum))
+		return 0;
+	if (map_compactor_extent_blocks <= 0)
+		return 0;
+	extent_blocks = (BlockNumber) map_compactor_extent_blocks;
+
+	if (!MapSBlockTryGetLogicalNblocks(map_ctx, rnode, forknum, &n_lblknos) ||
+		n_lblknos == 0)
+		return 0;
+	if (!MapSBlockTryGetNextFreePhysBlock(map_ctx, rnode, forknum,
+										  &next_free_pblk) ||
+		next_free_pblk == 0)
+		return 0;
+	if (!MapSBlockTryGetReclaimBoundary(map_ctx, rnode, forknum,
+										&reclaim_boundary_pblk))
+		return 0;
+	if (reclaim_boundary_pblk > next_free_pblk)
+		reclaim_boundary_pblk = next_free_pblk;
+	reclaim_seg_limit = reclaim_boundary_pblk / ((BlockNumber) RELSEG_SIZE);
+
+	if (!umfile_ctx_fork_exists(map_ctx, UMBRA_METADATA_FORKNUM,
+								UMFILE_EXISTS_DENSE))
+		return 0;
+	n_map_pages = umfile_ctx_get_nblocks(map_ctx, UMBRA_METADATA_FORKNUM,
+										 UMFILE_NBLOCKS_DENSE);
+	if (n_map_pages == 0)
+		return 0;
+	page_count = (n_lblknos + MAP_ENTRIES_PER_PAGE - 1) / MAP_ENTRIES_PER_PAGE;
+	if (page_count == 0)
+		return 0;
+
+	extent_ctx = AllocSetContextCreate(CurrentMemoryContext,
+									   "MapCompactorExtentLiveContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BlockNumber);
+	ctl.entrysize = sizeof(MapExtentLiveEntry);
+	ctl.hcxt = extent_ctx;
+	extent_live = hash_create("Map Compactor Extent Live",
+							  1024,
+							  &ctl,
+							  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	if (reclaim_seg_limit > 0)
+		segment_live = MemoryContextAllocZero(extent_ctx,
+											  sizeof(bool) * reclaim_seg_limit);
+	if (reclaim_boundary_pblk < next_free_pblk)
+	{
+		MemSet(&boundary_ctl, 0, sizeof(boundary_ctl));
+		boundary_ctl.keysize = sizeof(BlockNumber);
+		boundary_ctl.entrysize = sizeof(MapBoundaryPblkEntry);
+		boundary_ctl.hcxt = extent_ctx;
+		boundary_committed = hash_create("Map Compactor Boundary Committed",
+										 1024,
+										 &boundary_ctl,
+										 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	for (page_idx = 0; page_idx < page_count; page_idx++)
+	{
+		BlockNumber	page_no = MapForkPageIndexToMapBlkno(forknum, page_idx);
+		int			entry_idx;
+		int			limit_idx;
+		MapPage	   *page;
+		MapBufferDesc *buf;
+		BlockNumber	extent_no;
+		BlockNumber	segno;
+		bool		found;
+
+		if (page_no >= n_map_pages)
+			break;
+
+		if (page_no != current_page)
+		{
+			if (current_slot >= 0)
+				MapUnpinBuffer(current_slot);
+			current_slot = MapReadBuffer(map_ctx, rnode, forknum, page_no);
+			current_page = page_no;
+		}
+
+		buf = &MapBuffers[current_slot];
+		page = MapGetPage(current_slot);
+		if (!LWLockConditionalAcquire(&buf->buffer_lock, LW_SHARED))
+			continue;
+		limit_idx = MAP_ENTRIES_PER_PAGE;
+		if (page_idx == page_count - 1 && (n_lblknos % MAP_ENTRIES_PER_PAGE) != 0)
+			limit_idx = n_lblknos % MAP_ENTRIES_PER_PAGE;
+		for (entry_idx = 0; entry_idx < limit_idx; entry_idx++)
+		{
+			BlockNumber pblkno = page->pblknos[entry_idx];
+
+			if (pblkno == InvalidBlockNumber)
+				continue;
+
+			if (boundary_committed != NULL &&
+				pblkno >= reclaim_boundary_pblk &&
+				pblkno < next_free_pblk)
+			{
+				(void) hash_search(boundary_committed,
+								   &pblkno,
+								   HASH_ENTER,
+								   NULL);
+			}
+
+			if (pblkno >= reclaim_boundary_pblk)
+				continue;
+
+			extent_no = pblkno / extent_blocks;
+			segno = pblkno / ((BlockNumber) RELSEG_SIZE);
+			if (max_extent_no == InvalidBlockNumber || extent_no > max_extent_no)
+				max_extent_no = extent_no;
+
+			live_entry = (MapExtentLiveEntry *) hash_search(extent_live,
+															&extent_no,
+															HASH_ENTER,
+															&found);
+			if (!found)
+				live_entry->live_blocks = 0;
+			if (live_entry->live_blocks < extent_blocks)
+				live_entry->live_blocks++;
+
+			if (segment_live != NULL && segno < reclaim_seg_limit)
+				segment_live[segno] = true;
+		}
+		LWLockRelease(&buf->buffer_lock);
+	}
+
+	if (current_slot >= 0)
+		MapUnpinBuffer(current_slot);
+
+	live_threshold = map_compactor_low_live_percent;
+	if (live_threshold < 1)
+		live_threshold = 1;
+	if (live_threshold > 100)
+		live_threshold = 100;
+	moves_limit = Max(0, max_moves);
+
+	hash_seq_init(&seq, extent_live);
+	while ((live_entry = (MapExtentLiveEntry *) hash_seq_search(&seq)) != NULL)
+	{
+		int			live_pct;
+
+		if (max_extent_no != InvalidBlockNumber &&
+			live_entry->extent_no == max_extent_no)
+			continue;
+		if (!MapExtentFullyBelowReclaimBoundary(reclaim_boundary_pblk,
+												live_entry->extent_no,
+												extent_blocks))
+			continue;
+
+		live_pct = (int) (((uint64) live_entry->live_blocks * 100) /
+						  Max((uint64) 1, (uint64) extent_blocks));
+		if (live_pct > live_threshold)
+			continue;
+
+		sparse_count++;
+	}
+
+	for (BlockNumber segno = 0; segno < reclaim_seg_limit; segno++)
+	{
+		if (segment_live != NULL && segment_live[segno])
+			continue;
+		if (!umfile_ctx_segment_exists(map_ctx, forknum, segno))
+			continue;
+		if (MapSegmentHasLiveReferences(map_ctx, rnode, forknum, segno))
+			continue;
+
+		(void) MapReclaimEnqueueSegment(map_ctx, rnode, forknum, segno);
+	}
+
+	if (sparse_count > 0 && moves_limit > 0)
+	{
+		BlockNumber	pass_page = InvalidBlockNumber;
+		int			pass_slot = -1;
+
+		for (page_idx = 0; page_idx < page_count; page_idx++)
+		{
+			BlockNumber	page_no = MapForkPageIndexToMapBlkno(forknum, page_idx);
+			int			entry_idx;
+			int			limit_idx;
+
+			if (moved_here >= moves_limit)
+				break;
+			if (moves_done != NULL && *moves_done >= moves_limit)
+				break;
+			if (page_no >= n_map_pages)
+				break;
+
+			if (page_no != pass_page)
+			{
+				if (pass_slot >= 0)
+					MapUnpinBuffer(pass_slot);
+				pass_slot = MapReadBuffer(map_ctx, rnode, forknum, page_no);
+				pass_page = page_no;
+			}
+
+			limit_idx = MAP_ENTRIES_PER_PAGE;
+			if (page_idx == page_count - 1 && (n_lblknos % MAP_ENTRIES_PER_PAGE) != 0)
+				limit_idx = n_lblknos % MAP_ENTRIES_PER_PAGE;
+			for (entry_idx = 0; entry_idx < limit_idx; entry_idx++)
+			{
+				BlockNumber lblkno = page_idx * MAP_ENTRIES_PER_PAGE + entry_idx;
+				MapPage	   *page;
+				MapBufferDesc *buf;
+				BlockNumber pblkno;
+				BlockNumber extent_no;
+				bool candidate = false;
+
+				buf = &MapBuffers[pass_slot];
+				page = MapGetPage(pass_slot);
+				if (!LWLockConditionalAcquire(&buf->buffer_lock, LW_SHARED))
+					continue;
+				pblkno = page->pblknos[entry_idx];
+				LWLockRelease(&buf->buffer_lock);
+
+				if (pblkno == InvalidBlockNumber)
+					continue;
+
+				extent_no = pblkno / extent_blocks;
+				if (max_extent_no != InvalidBlockNumber &&
+					extent_no == max_extent_no)
+					continue;
+				if (!MapExtentFullyBelowReclaimBoundary(reclaim_boundary_pblk,
+														extent_no,
+														extent_blocks))
+					continue;
+
+				live_entry = (MapExtentLiveEntry *) hash_search(extent_live,
+																&extent_no,
+																HASH_FIND,
+																NULL);
+				if (live_entry != NULL)
+				{
+					int live_pct = (int) (((uint64) live_entry->live_blocks * 100) /
+										  Max((uint64) 1, (uint64) extent_blocks));
+					if (live_pct <= live_threshold)
+						candidate = true;
+				}
+
+				if (!candidate)
+					continue;
+
+				if (MapCompactorRelocateEntry(map_ctx, rnode, forknum,
+											  lblkno, pblkno))
+				{
+					moved_here++;
+					if (moves_done != NULL)
+						(*moves_done)++;
+
+					live_entry = (MapExtentLiveEntry *) hash_search(extent_live,
+																	&extent_no,
+																	HASH_FIND,
+																	NULL);
+					if (live_entry != NULL && live_entry->live_blocks > 0)
+					{
+						live_entry->live_blocks--;
+						if (live_entry->live_blocks == 0)
+						{
+							BlockNumber	start_blkno = extent_no * extent_blocks;
+							BlockNumber	segno = start_blkno / ((BlockNumber) RELSEG_SIZE);
+
+							if (!MapSegmentHasLiveReferences(map_ctx, rnode, forknum, segno))
+								(void) MapReclaimEnqueue(map_ctx, rnode, forknum,
+														 extent_no, extent_blocks);
+						}
+					}
+				}
+			}
+		}
+
+		if (pass_slot >= 0)
+			MapUnpinBuffer(pass_slot);
+	}
+
+	if (boundary_committed != NULL)
+	{
+		advanced_boundary_pblk =
+			MapCompactorAdvanceBoundaryFromSet(reclaim_boundary_pblk,
+											   next_free_pblk,
+											   boundary_committed);
+		if (advanced_boundary_pblk > reclaim_boundary_pblk)
+			MapSBlockAdvanceReclaimBoundary(map_ctx, rnode, forknum,
+											advanced_boundary_pblk);
+	}
+
+	hash_destroy(extent_live);
+	MemoryContextDelete(extent_ctx);
+	return sparse_count;
+}
+
+int
+MapCompactorStep(int max_relations)
+{
+	static int	scan_slot = 0;
+	int			max_scan;
+	int			scanned = 0;
+	int			visited = 0;
+	int			total_moves = 0;
+	int			move_budget;
+
+	if (!map_compactor_enable || InRecovery || max_relations <= 0 ||
+		MapSuperCapacity <= 0)
+		return 0;
+
+	move_budget = Max(0, map_compactor_max_moves);
+
+	max_scan = Min(MapSuperCapacity, Max(64, max_relations * 8));
+	while (scanned < max_scan && visited < max_relations)
+	{
+		MapSuperEntry *entry;
+		RelFileLocator	rnode;
+		RelFileLocatorBackend rlocator;
+		UmbraFileContext *ctx;
+		bool		has_init_fork;
+		bool		has_map_fork;
+		entry = MapSuperEntryBySlot(scan_slot);
+		scan_slot = (scan_slot + 1) % MapSuperCapacity;
+		scanned++;
+
+		if (!LWLockConditionalAcquire(&entry->lock, LW_SHARED))
+			continue;
+		if (!entry->in_use)
+		{
+			LWLockRelease(&entry->lock);
+			continue;
+		}
+		rnode = entry->key.rnode;
+		LWLockRelease(&entry->lock);
+		visited++;
+
+		rlocator.locator = rnode;
+		rlocator.backend = INVALID_PROC_NUMBER;
+		ctx = umfile_ctx_acquire(rlocator);
+		if (ctx == NULL)
+			continue;
+
+		has_init_fork = umfile_ctx_fork_exists(ctx, INIT_FORKNUM,
+											   UMFILE_EXISTS_DENSE);
+		if (has_init_fork)
+			continue;
+		has_map_fork = umfile_ctx_fork_exists(ctx, UMBRA_METADATA_FORKNUM,
+											  UMFILE_EXISTS_DENSE);
+		if (!has_map_fork)
+			continue;
+
+		MapCompactorAnalyzeFork(ctx, rnode, MAIN_FORKNUM,
+							   move_budget, &total_moves);
+		MapCompactorAnalyzeFork(ctx, rnode, FSM_FORKNUM,
+							   move_budget, &total_moves);
+		MapCompactorAnalyzeFork(ctx, rnode, VISIBILITYMAP_FORKNUM,
+							   move_budget, &total_moves);
+	}
+
+	/*
+	 * Fresh relations usually occupy low-numbered super slots first. If a
+	 * startup-time sweep observes only inactive slots, don't keep marching the
+	 * scan cursor forward in large strides, or newly created relations can sit
+	 * behind a full rotation before compactor ever visits them.
+	 */
+	if (visited == 0)
+		scan_slot = 0;
+
+	return total_moves;
+}
diff --git a/src/backend/storage/map/mapinit.c b/src/backend/storage/map/mapinit.c
index c30057cf04..6014b7558d 100644
--- a/src/backend/storage/map/mapinit.c
+++ b/src/backend/storage/map/mapinit.c
@@ -35,6 +35,10 @@ int			map_prealloc_fsm_batch = 128; /* 1MB in 8k blocks */
 int			map_prealloc_vm_low = 64;	/* 512kB in 8k blocks */
 int			map_prealloc_vm_hard = 16;	/* 128kB in 8k blocks */
 int			map_prealloc_vm_batch = 128; /* 1MB in 8k blocks */
+bool		map_compactor_enable = false;
+int			map_compactor_extent_blocks = 1024;	/* 8MB in 8k blocks */
+int			map_compactor_low_live_percent = 10;
+int			map_compactor_max_moves = 16;

 /* Shared memory pointer */
 MapSharedData *MapShared = NULL;
@@ -115,8 +119,13 @@ MapShmemInit(void *arg)
 	MapShared->num_slots = map_buffers;
 	MapShared->first_free_buffer = 0;
 	MapShared->mapwriter_procno = -1;
+	MapShared->mapcompactor_procno = -1;
 	pg_atomic_init_u32(&MapShared->next_victim_buffer, 0);
 	pg_atomic_init_u32(&MapShared->num_allocs, 0);
+	pg_atomic_init_u64(&MapShared->map_compactor_relocations, 0);
+	pg_atomic_init_u64(&MapShared->map_reclaim_enqueued, 0);
+	pg_atomic_init_u64(&MapShared->map_reclaim_processed, 0);
+	pg_atomic_init_u64(&MapShared->map_reclaim_failed, 0);
 	MapShared->complete_passes = 0;
 	SpinLockInit(&MapShared->clock_lock);

@@ -156,3 +165,75 @@ MapShmemAttach(void *arg)

 	MapSuperTableShmemAttach();
 }
+
+void
+MapStatsAddCompactorRelocations(uint64 count)
+{
+	if (MapShared == NULL || count == 0)
+		return;
+
+	pg_atomic_fetch_add_u64(&MapShared->map_compactor_relocations, count);
+}
+
+void
+MapStatsAddReclaimEnqueued(uint64 count)
+{
+	if (MapShared == NULL || count == 0)
+		return;
+
+	pg_atomic_fetch_add_u64(&MapShared->map_reclaim_enqueued, count);
+}
+
+void
+MapStatsAddReclaimProcessed(uint64 count)
+{
+	if (MapShared == NULL || count == 0)
+		return;
+
+	pg_atomic_fetch_add_u64(&MapShared->map_reclaim_processed, count);
+}
+
+void
+MapStatsAddReclaimFailed(uint64 count)
+{
+	if (MapShared == NULL || count == 0)
+		return;
+
+	pg_atomic_fetch_add_u64(&MapShared->map_reclaim_failed, count);
+}
+
+uint64
+MapStatsGetCompactorRelocations(void)
+{
+	if (MapShared == NULL)
+		return 0;
+
+	return pg_atomic_read_u64(&MapShared->map_compactor_relocations);
+}
+
+uint64
+MapStatsGetReclaimEnqueued(void)
+{
+	if (MapShared == NULL)
+		return 0;
+
+	return pg_atomic_read_u64(&MapShared->map_reclaim_enqueued);
+}
+
+uint64
+MapStatsGetReclaimProcessed(void)
+{
+	if (MapShared == NULL)
+		return 0;
+
+	return pg_atomic_read_u64(&MapShared->map_reclaim_processed);
+}
+
+uint64
+MapStatsGetReclaimFailed(void)
+{
+	if (MapShared == NULL)
+		return 0;
+
+	return pg_atomic_read_u64(&MapShared->map_reclaim_failed);
+}
diff --git a/src/backend/storage/map/mapsuper.c b/src/backend/storage/map/mapsuper.c
index e3e9421566..e0d370acd6 100644
--- a/src/backend/storage/map/mapsuper.c
+++ b/src/backend/storage/map/mapsuper.c
@@ -71,6 +71,10 @@ static BlockNumber MapSuperGetExtendingTarget(const MapSuperEntry *entry,
 static void MapSuperSetExtendingTarget(MapSuperEntry *entry,
 									   ForkNumber forknum,
 									   BlockNumber nblocks);
+static void MapSuperSetReclaimBoundary(MapSuperEntry *entry,
+									   ForkNumber forknum,
+									   BlockNumber boundary_pblk);
+static void MapSuperResetReclaimBoundaries(MapSuperEntry *entry);
 static bool MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx,
 										  RelFileLocator rnode,
 										  XLogRecPtr map_lsn,
@@ -581,6 +585,9 @@ MapSuperEnsureEntryLocked(RelFileLocator rnode)
 	entry->extending_target_main = InvalidBlockNumber;
 	entry->extending_target_fsm = InvalidBlockNumber;
 	entry->extending_target_vm = InvalidBlockNumber;
+	entry->reclaim_boundary_main = 0;
+	entry->reclaim_boundary_fsm = 0;
+	entry->reclaim_boundary_vm = 0;
 	MapSuperIndex[insert_bucket].slot_id = slot_id;

 	LWLockAcquire(&entry->lock, LW_EXCLUSIVE);
@@ -618,6 +625,9 @@ MapSuperDeleteEntry(RelFileLocator rnode)
 		entry->extending_target_main = InvalidBlockNumber;
 		entry->extending_target_fsm = InvalidBlockNumber;
 		entry->extending_target_vm = InvalidBlockNumber;
+		entry->reclaim_boundary_main = 0;
+		entry->reclaim_boundary_fsm = 0;
+		entry->reclaim_boundary_vm = 0;
 		entry->in_use = false;
 		SpinLockAcquire(&MapSuperCtlData->free_list_lock);
 		entry->next_free = MapSuperCtlData->free_head;
@@ -654,6 +664,7 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
 												  MapSuperblockGetNextFreePhysBlock(&entry->super,
 																					MAIN_FORKNUM)) <=
@@ -673,6 +684,7 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
 												  MapSuperblockGetNextFreePhysBlock(&entry->super,
 																					MAIN_FORKNUM)) <=
@@ -708,6 +720,7 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
 												  MapSuperblockGetNextFreePhysBlock(&entry->super,
 																					MAIN_FORKNUM)) <=
@@ -727,6 +740,7 @@ MapSBlockRead(UmbraFileContext *map_ctx, RelFileLocator rnode, MapSuperblock *su
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 				Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
 												  MapSuperblockGetNextFreePhysBlock(&entry->super,
 																					MAIN_FORKNUM)) <=
@@ -910,6 +924,67 @@ MapSuperSetExtendingTarget(MapSuperEntry *entry, ForkNumber forknum,
 	}
 }

+BlockNumber
+MapSuperGetReclaimBoundary(const MapSuperEntry *entry, ForkNumber forknum)
+{
+	Assert(entry != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			return entry->reclaim_boundary_main;
+		case FSM_FORKNUM:
+			return entry->reclaim_boundary_fsm;
+		case VISIBILITYMAP_FORKNUM:
+			return entry->reclaim_boundary_vm;
+		default:
+			elog(ERROR, "unsupported fork number for reclaim boundary: %d", forknum);
+	}
+
+	pg_unreachable();
+}
+
+static void
+MapSuperSetReclaimBoundary(MapSuperEntry *entry, ForkNumber forknum,
+						   BlockNumber boundary_pblk)
+{
+	Assert(entry != NULL);
+
+	switch (forknum)
+	{
+		case MAIN_FORKNUM:
+			entry->reclaim_boundary_main = boundary_pblk;
+			break;
+		case FSM_FORKNUM:
+			entry->reclaim_boundary_fsm = boundary_pblk;
+			break;
+		case VISIBILITYMAP_FORKNUM:
+			entry->reclaim_boundary_vm = boundary_pblk;
+			break;
+		default:
+			elog(ERROR, "unsupported fork number for reclaim boundary: %d", forknum);
+	}
+}
+
+static void
+MapSuperResetReclaimBoundaries(MapSuperEntry *entry)
+{
+	Assert(entry != NULL);
+
+	entry->reclaim_boundary_main =
+		MapNormalizeForkBlockCount(MAIN_FORKNUM,
+								   MapSuperblockGetNextFreePhysBlock(&entry->super,
+																	 MAIN_FORKNUM));
+	entry->reclaim_boundary_fsm =
+		MapNormalizeForkBlockCount(FSM_FORKNUM,
+								   MapSuperblockGetNextFreePhysBlock(&entry->super,
+																	 FSM_FORKNUM));
+	entry->reclaim_boundary_vm =
+		MapNormalizeForkBlockCount(VISIBILITYMAP_FORKNUM,
+								   MapSuperblockGetNextFreePhysBlock(&entry->super,
+																	 VISIBILITYMAP_FORKNUM));
+}
+
 static bool
 MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 							  XLogRecPtr map_lsn, const char *missing_errmsg,
@@ -946,6 +1021,7 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 			}
 			else
 			{
@@ -953,6 +1029,7 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 			}
 		}
 	}
@@ -974,6 +1051,7 @@ MapSuperPrepareEntryForUpdate(UmbraFileContext *map_ctx, RelFileLocator rnode,
 		MapSuperblockInit(&entry->super, 0);
 		entry->flags = MAPSUPER_FLAG_VALID;
 		MapSuperResetReservedNextFrees(entry);
+		MapSuperResetReclaimBoundaries(entry);
 	}

 	Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
@@ -1103,6 +1181,8 @@ MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx, RelFileLocator rnode,
 	{
 		MapSuperblockSetNextFreePhysBlock(&entry->super, forknum, nblocks);
 		MapSuperMaybeBumpReservedNextFree(entry, forknum, nblocks);
+		if (InRecovery)
+			MapSuperSetReclaimBoundary(entry, forknum, nblocks);
 		changed = true;
 	}
 	if (bump_capacity && current_capacity < nblocks)
@@ -1274,6 +1354,7 @@ MapSBlockInit(UmbraFileContext *map_ctx, RelFileLocator rnode, XLogRecPtr map_ls
 	MapSuperblockSetLastUpdatedLSN(&entry->super, entry->page_lsn);
 	entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_DIRTY;
 	MapSuperResetReservedNextFrees(entry);
+	MapSuperResetReclaimBoundaries(entry);
 	Assert(MapNormalizeForkBlockCount(MAIN_FORKNUM,
 									  MapSuperblockGetNextFreePhysBlock(&entry->super,
 																		MAIN_FORKNUM)) <=
@@ -1336,6 +1417,7 @@ MapSBlockEnsureLoaded(UmbraFileContext *map_ctx, RelFileLocator rnode)
 				entry->page_lsn = MapSuperblockGetLastUpdatedLSN(&disk_super);
 				entry->flags = MAPSUPER_FLAG_VALID;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 			}
 			else
 			{
@@ -1343,6 +1425,7 @@ MapSBlockEnsureLoaded(UmbraFileContext *map_ctx, RelFileLocator rnode)
 				entry->page_lsn = InvalidXLogRecPtr;
 				entry->flags = MAPSUPER_FLAG_VALID | MAPSUPER_FLAG_CORRUPT;
 				MapSuperResetReservedNextFrees(entry);
+				MapSuperResetReclaimBoundaries(entry);
 			}
 		}
 	}
@@ -1481,6 +1564,83 @@ MapSBlockTryGetNextFreePhysBlock(UmbraFileContext *map_ctx, RelFileLocator rnode
 	return true;
 }

+bool
+MapSBlockTryGetReclaimBoundary(UmbraFileContext *map_ctx, RelFileLocator rnode,
+							   ForkNumber forknum, BlockNumber *boundary_pblk)
+{
+	MapSuperEntry *entry;
+	uint32		flags;
+
+	Assert(boundary_pblk != NULL);
+
+	if (!MapForkHasMappedState(forknum))
+		return false;
+	if (!MapSBlockEnsureLoaded(map_ctx, rnode))
+		return false;
+	if (!MapSuperFindEntryLocked(rnode, LW_SHARED, &entry))
+		return false;
+
+	flags = entry->flags;
+	if ((flags & MAPSUPER_FLAG_CORRUPT) ||
+		!MapSuperblockHasValidIdentity(&entry->super) ||
+		((flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+		 !MapSuperblockCheckCRC(&entry->super)))
+	{
+		LWLockRelease(&entry->lock);
+		if (!InRecovery)
+			MapSBlockReportCorrupt(rnode, "invalid identity or CRC");
+		return false;
+	}
+
+	*boundary_pblk = MapNormalizeForkBlockCount(forknum,
+												MapSuperGetReclaimBoundary(entry,
+																		   forknum));
+	LWLockRelease(&entry->lock);
+	return true;
+}
+
+void
+MapSBlockAdvanceReclaimBoundary(UmbraFileContext *map_ctx, RelFileLocator rnode,
+								ForkNumber forknum, BlockNumber boundary_pblk)
+{
+	MapSuperEntry *entry;
+	BlockNumber	current;
+	BlockNumber	next_free;
+
+	if (!MapForkHasMappedState(forknum))
+		return;
+	if (boundary_pblk == InvalidBlockNumber)
+		return;
+	if (!MapSBlockEnsureLoaded(map_ctx, rnode))
+		return;
+	if (!MapSuperFindEntryLocked(rnode, LW_EXCLUSIVE, &entry))
+		return;
+
+	if (!entry->in_use ||
+		(entry->flags & MAPSUPER_FLAG_VALID) == 0 ||
+		(entry->flags & MAPSUPER_FLAG_CORRUPT) != 0 ||
+		!MapSuperblockHasValidIdentity(&entry->super) ||
+		((entry->flags & MAPSUPER_FLAG_DIRTY) == 0 &&
+		 !MapSuperblockCheckCRC(&entry->super)))
+	{
+		LWLockRelease(&entry->lock);
+		return;
+	}
+
+	current = MapNormalizeForkBlockCount(forknum,
+										 MapSuperGetReclaimBoundary(entry,
+																	forknum));
+	next_free = MapNormalizeForkBlockCount(forknum,
+										   MapSuperblockGetNextFreePhysBlock(&entry->super,
+																			 forknum));
+	if (boundary_pblk > next_free)
+		boundary_pblk = next_free;
+	if (boundary_pblk > current)
+		MapSuperSetReclaimBoundary(entry, forknum, boundary_pblk);
+
+	LWLockRelease(&entry->lock);
+}
+
 void
 MapSBlockBumpLogicalNblocks(UmbraFileContext *map_ctx, RelFileLocator rnode,
 							ForkNumber forknum, BlockNumber nblocks,
@@ -1612,6 +1772,9 @@ MapSuperTableShmemInit(void)
 		entry->extending_target_main = InvalidBlockNumber;
 		entry->extending_target_fsm = InvalidBlockNumber;
 		entry->extending_target_vm = InvalidBlockNumber;
+		entry->reclaim_boundary_main = 0;
+		entry->reclaim_boundary_fsm = 0;
+		entry->reclaim_boundary_vm = 0;
 		LWLockInitialize(&entry->lock, LWTRANCHE_MAP_BUFFER_CONTENT);
 	}

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8ba29edc56..777b66ab4c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -150,6 +150,10 @@ typedef struct f_smgr
 	void		(*smgr_invalidate_database_tablespaces) (Oid dbid,
 														 int ntablespaces,
 														 const Oid *tablespace_ids);
+	uint64		(*smgr_get_map_compactor_relocations) (void);
+	uint64		(*smgr_get_map_reclaim_enqueued) (void);
+	uint64		(*smgr_get_map_reclaim_processed) (void);
+	uint64		(*smgr_get_map_reclaim_failed) (void);
 	void		(*smgr_mark_skip_wal_pending) (SMgrRelation reln);
 	void		(*smgr_clear_skip_wal_pending) (SMgrRelation reln);
 	bool		(*smgr_prepare_pendingsync) (SMgrRelation reln);
@@ -194,6 +198,10 @@ static const f_smgr smgrsw[] = {
 		.smgr_redo_create_fork = umredocreatefork,
 		.smgr_checkpoint_database_tablespaces = umcheckpointdatabasetablespaces,
 		.smgr_invalidate_database_tablespaces = uminvalidatedatabasetablespaces,
+		.smgr_get_map_compactor_relocations = umgetmapcompactorrelocations,
+		.smgr_get_map_reclaim_enqueued = umgetmapreclaimenqueued,
+		.smgr_get_map_reclaim_processed = umgetmapreclaimprocessed,
+		.smgr_get_map_reclaim_failed = umgetmapreclaimfailed,
 		.smgr_mark_skip_wal_pending = ummarkskipwalpending,
 		.smgr_clear_skip_wal_pending = umclearskipwalpending,
 		.smgr_prepare_pendingsync = umpreparependingsync,
@@ -694,6 +702,42 @@ smgrinvalidatedatabase(Oid dbid)
 	smgrinvalidatedatabasetablespaces(dbid, 0, NULL);
 }

+uint64
+smgrgetmapcompactorrelocations(void)
+{
+	if (smgrsw[0].smgr_get_map_compactor_relocations)
+		return smgrsw[0].smgr_get_map_compactor_relocations();
+
+	return 0;
+}
+
+uint64
+smgrgetmapreclaimenqueued(void)
+{
+	if (smgrsw[0].smgr_get_map_reclaim_enqueued)
+		return smgrsw[0].smgr_get_map_reclaim_enqueued();
+
+	return 0;
+}
+
+uint64
+smgrgetmapreclaimprocessed(void)
+{
+	if (smgrsw[0].smgr_get_map_reclaim_processed)
+		return smgrsw[0].smgr_get_map_reclaim_processed();
+
+	return 0;
+}
+
+uint64
+smgrgetmapreclaimfailed(void)
+{
+	if (smgrsw[0].smgr_get_map_reclaim_failed)
+		return smgrsw[0].smgr_get_map_reclaim_failed();
+
+	return 0;
+}
+
 void
 smgrmarkskipwalpending(RelFileLocator rlocator)
 {
diff --git a/src/backend/storage/smgr/umbra.c b/src/backend/storage/smgr/umbra.c
index 61c74a2378..f7e3625b6e 100644
--- a/src/backend/storage/smgr/umbra.c
+++ b/src/backend/storage/smgr/umbra.c
@@ -2633,3 +2633,27 @@ umfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 		ftag->forknum == candidate->forknum &&
 		ftag->segno == candidate->segno;
 }
+
+uint64
+umgetmapcompactorrelocations(void)
+{
+	return MapStatsGetCompactorRelocations();
+}
+
+uint64
+umgetmapreclaimenqueued(void)
+{
+	return MapStatsGetReclaimEnqueued();
+}
+
+uint64
+umgetmapreclaimprocessed(void)
+{
+	return MapStatsGetReclaimProcessed();
+}
+
+uint64
+umgetmapreclaimfailed(void)
+{
+	return MapStatsGetReclaimFailed();
+}
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 51ed171c33..8c533a4ea7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -30,6 +30,8 @@
 #include "storage/latch.h"
 #include "storage/md.h"
 #ifdef USE_UMBRA
+#include "access/umbra_xlog.h"
+#include "storage/map.h"
 #include "storage/umbra.h"
 #endif
 #include "utils/hsearch.h"
@@ -69,6 +71,7 @@ typedef struct
 	FileTag		tag;			/* identifies handler and file */
 	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
 	bool		canceled;		/* true if request has been canceled */
+	SyncRequestType request_type;	/* plain unlink vs reclaim unlink */
 } PendingUnlinkEntry;

static HTAB *pendingOps = NULL;
@@ -128,6 +131,51 @@ static const SyncOps syncsw[] = {
#endif
};

+static inline const SyncOps *
+SyncOpsForHandler(int16 handler)
+{
+	if (handler < 0 || handler >= (int) lengthof(syncsw))
+		ereport(PANIC,
+				(errmsg("invalid sync request handler"),
+				 errdetail("handler=%d", handler),
+				 errbacktrace()));
+
+	return &syncsw[handler];
+}
+
+static inline void
+SyncPreUnlinkRequest(const PendingUnlinkEntry *entry)
+{
+#ifdef USE_UMBRA
+	if (entry->request_type == SYNC_RECLAIM_REQUEST &&
+		entry->tag.handler == SYNC_HANDLER_UMBRA)
+		log_umbra_reclaim_unlink(entry->tag.rlocator,
+								 entry->tag.forknum,
+								 (BlockNumber) entry->tag.segno);
+#else
+	(void) entry;
+#endif
+}
+
+static inline void
+SyncPostUnlinkRequest(const PendingUnlinkEntry *entry,
+					  int unlink_rc, int unlink_errno)
+{
+#ifdef USE_UMBRA
+	if (entry->request_type == SYNC_RECLAIM_REQUEST)
+	{
+		if (unlink_rc < 0 && unlink_errno != ENOENT)
+			MapStatsAddReclaimFailed(1);
+		else
+			MapStatsAddReclaimProcessed(1);
+	}
+#else
+	(void) entry;
+	(void) unlink_rc;
+	(void) unlink_errno;
+#endif
+}
+
 /*
  * Initialize data structures for the file sync tracking.
  */
@@ -220,6 +268,8 @@ SyncPostCheckpoint(void)
 	{
 		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(lc);
 		char		path[MAXPGPATH];
+		int			unlink_rc;
+		int			unlink_errno;

/* Skip over any canceled entries */
if (entry->canceled)
@@ -238,8 +288,21 @@ SyncPostCheckpoint(void)
break;

 		/* Unlink the file */
-		if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
-														  path) < 0)
+		{
+			const SyncOps *ops = SyncOpsForHandler(entry->tag.handler);
+
+			if (ops->sync_unlinkfiletag == NULL)
+				ereport(PANIC,
+						(errmsg("sync unlink request uses handler without unlink function"),
+						 errdetail("handler=%d", entry->tag.handler),
+						 errbacktrace()));
+
+			SyncPreUnlinkRequest(entry);
+			unlink_rc = ops->sync_unlinkfiletag(&entry->tag, path);
+			unlink_errno = errno;
+		}
+
+		if (unlink_rc < 0)
 		{
 			/*
 			 * There's a race condition, when the database is dropped at the
@@ -248,14 +311,18 @@ SyncPostCheckpoint(void)
 			 * here. rmtree() also has to ignore ENOENT errors, to deal with
 			 * the possibility that we delete the file first.
 			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
+			if (unlink_errno != ENOENT)
+			{
+				errno = unlink_errno;
+				ereport(entry->request_type == SYNC_RECLAIM_REQUEST ? DEBUG1 : WARNING,
 						(errcode_for_file_access(),
 						 errmsg("could not remove file \"%s\": %m", path)));
+			}
 		}

-		/* Mark the list entry as canceled, just in case */
-		entry->canceled = true;
+			SyncPostUnlinkRequest(entry, unlink_rc, unlink_errno);
+			/* Mark the list entry as canceled, just in case */
+			entry->canceled = true;

 		/*
 		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
@@ -516,27 +583,38 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
 		HASH_SEQ_STATUS hstat;
 		PendingFsyncEntry *pfe;
 		ListCell   *cell;
+		const SyncOps *ops = SyncOpsForHandler(ftag->handler);
+
+		if (ops->sync_filetagmatches == NULL)
+			ereport(PANIC,
+					(errmsg("sync filter request uses handler without match function"),
+					 errdetail("handler=%d", ftag->handler),
+					 errbacktrace()));

 		/* Cancel matching fsync requests */
 		hash_seq_init(&hstat, pendingOps);
 		while ((pfe = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 		{
 			if (pfe->tag.handler == ftag->handler &&
-				syncsw[ftag->handler].sync_filetagmatches(ftag, &pfe->tag))
+				ops->sync_filetagmatches(ftag, &pfe->tag))
 				pfe->canceled = true;
 		}

-		/* Cancel matching unlink requests */
+		/* Remove matching reclaim unlink requests only. */
 		foreach(cell, pendingUnlinks)
 		{
 			PendingUnlinkEntry *pue = (PendingUnlinkEntry *) lfirst(cell);

 			if (pue->tag.handler == ftag->handler &&
-				syncsw[ftag->handler].sync_filetagmatches(ftag, &pue->tag))
-				pue->canceled = true;
+				pue->request_type == SYNC_RECLAIM_REQUEST &&
+				ops->sync_filetagmatches(ftag, &pue->tag))
+			{
+				pendingUnlinks = foreach_delete_current(pendingUnlinks, cell);
+				pfree(pue);
+			}
 		}
 	}
-	else if (type == SYNC_UNLINK_REQUEST)
+	else if (type == SYNC_UNLINK_REQUEST || type == SYNC_RECLAIM_REQUEST)
 	{
 		/* Unlink request: put it in the linked list */
 		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
@@ -546,6 +624,7 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
 		entry->tag = *ftag;
 		entry->cycle_ctr = checkpoint_cycle_ctr;
 		entry->canceled = false;
+		entry->request_type = type;

pendingUnlinks = lappend(pendingUnlinks, entry);

diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ec5e2eabf4..7cf5598ecf 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -63,6 +63,8 @@ LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher proc
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
 MAPWRITER_HIBERNATE	"Waiting in Umbra map writer process, hibernating."
 MAPWRITER_MAIN	"Waiting in main loop of Umbra map writer process."
+MAPCOMPACTOR_HIBERNATE	"Waiting in Umbra map compactor process, hibernating."
+MAPCOMPACTOR_MAIN	"Waiting in main loop of Umbra map compactor process."
 RECOVERY_WAL_STREAM	"Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
 REPLICATION_SLOTSYNC_MAIN	"Waiting in main loop of slot synchronization."
 REPLICATION_SLOTSYNC_SHUTDOWN	"Waiting for slot sync worker to shut down."
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1408de387e..ded745d995 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -26,6 +26,7 @@
 #include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "replication/logicallauncher.h"
+#include "storage/smgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
@@ -1336,6 +1337,30 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }

+Datum
+pg_stat_get_map_compactor_relocations(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64((int64) smgrgetmapcompactorrelocations());
+}
+
+Datum
+pg_stat_get_map_reclaim_enqueued(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64((int64) smgrgetmapreclaimenqueued());
+}
+
+Datum
+pg_stat_get_map_reclaim_processed(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64((int64) smgrgetmapreclaimprocessed());
+}
+
+Datum
+pg_stat_get_map_reclaim_failed(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64((int64) smgrgetmapreclaimfailed());
+}
+
 /*
 * When adding a new column to the pg_stat_io view and the
 * pg_stat_get_backend_io() function, add a new enum value here above
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f3726be78d..a45e33e31d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1932,6 +1932,41 @@
   max => 'MAX_KILOBYTES',
 },

+{ name => 'map_compactor_enable', type => 'bool', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Enable extent-level compactor planning in Umbra.',
+  variable => 'map_compactor_enable',
+  boot_val => 'false',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_compactor_extent_blocks', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Extent size in blocks for Umbra map compactor planning.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'map_compactor_extent_blocks',
+  boot_val => '1024',
+  min => '128',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_compactor_low_live_percent', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Treat an extent as sparse when live ratio is at or below this percentage.',
+  variable => 'map_compactor_low_live_percent',
+  boot_val => '10',
+  min => '1',
+  max => '100',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'map_compactor_max_moves', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Maximum relocation moves Umbra map compactor performs per round.',
+  variable => 'map_compactor_max_moves',
+  boot_val => '16',
+  min => '0',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
 { name => 'map_prealloc_fsm_batch', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
   short_desc => 'Preallocation batch size in blocks for Umbra FSM fork.',
   flags => 'GUC_UNIT_BLOCKS',
@@ -2022,6 +2057,45 @@
   ifdef => 'USE_UMBRA',
 },

+{ name => 'map_superblocks', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the number of dedicated shared-memory slots for Umbra MAP superblocks.',
+  long_desc => 'These slots hold hot relation superblock metadata and are not managed as an LRU cache.',
+  variable => 'map_superblocks',
+  boot_val => '262144',
+  min => 'MAP_SUPERBLOCK_MIN_ENTRIES',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapcompactor_busy_alloc_threshold', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Map compactor yields when recent MAP allocation pressure reaches this threshold.',
+  long_desc => '0 disables busy-yield behavior.',
+  variable => 'MapCompactorBusyAllocThreshold',
+  boot_val => '128',
+  min => '0',
+  max => 'INT_MAX / 2',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapcompactor_delay', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Umbra map compactor sleep time between rounds.',
+  flags => 'GUC_UNIT_MS',
+  variable => 'MapCompactorDelay',
+  boot_val => '200',
+  min => '1',
+  max => '10000',
+  ifdef => 'USE_UMBRA',
+},
+
+{ name => 'mapcompactor_max_relations', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
+  short_desc => 'Maximum number of relations scanned per Umbra map compactor round.',
+  variable => 'MapCompactorMaxRelations',
+  boot_val => '8',
+  min => '0',
+  max => 'INT_MAX',
+  ifdef => 'USE_UMBRA',
+},
+
 { name => 'mapwriter_delay', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_BGWRITER',
   short_desc => 'Umbra map writer sleep time between rounds.',
   flags => 'GUC_UNIT_MS',
@@ -2059,6 +2133,7 @@
   max => 'INT_MAX / 2',
   ifdef => 'USE_UMBRA',
 },
+
 { name => 'max_active_replication_origins', type => 'int', context => 'PGC_POSTMASTER', group => 'REPLICATION_SUBSCRIBERS',
   short_desc => 'Sets the maximum number of active replication origins.',
   variable => 'max_active_replication_origins',
diff --git a/src/include/access/umbra_xlog.h b/src/include/access/umbra_xlog.h
index 6b2408d33c..bc068597d0 100644
--- a/src/include/access/umbra_xlog.h
+++ b/src/include/access/umbra_xlog.h
@@ -8,6 +8,7 @@
  * - RANGE_REMAP: atomically establish a range of first-born mappings
  * - RANGE_REMAP_COMPACT: same semantics for contiguous lblk/pblk runs
  * - SKIP_WAL_DENSE_MAP: record non-empty skip-WAL dense lblk==pblk frontiers
+ * - RECLAIM_UNLINK: physically remove one reclaimed relation segment
  *
  *-------------------------------------------------------------------------
  */
@@ -22,6 +23,7 @@
 /* XLOG gives us high 4 bits */
 #define XLOG_UMBRA_MAP_SET			0x10
 #define XLOG_UMBRA_RANGE_REMAP		0x30
+#define XLOG_UMBRA_RECLAIM_UNLINK	0x40
 #define XLOG_UMBRA_RANGE_REMAP_COMPACT	0x50
 #define XLOG_UMBRA_SKIP_WAL_DENSE_MAP	0x60

@@ -74,6 +76,13 @@ typedef struct xl_umbra_skip_wal_dense_map
xl_umbra_skip_wal_dense_map_entry entries[FLEXIBLE_ARRAY_MEMBER];
} xl_umbra_skip_wal_dense_map;

+typedef struct xl_umbra_reclaim_unlink
+{
+	RelFileLocator rlocator;
+	ForkNumber	forknum;
+	BlockNumber segno;
+} xl_umbra_reclaim_unlink;
+
 extern XLogRecPtr log_umbra_map_set(RelFileLocator rlocator, ForkNumber forknum,
 									BlockNumber lblkno, BlockNumber old_pblkno,
 									BlockNumber new_pblkno);
@@ -89,6 +98,9 @@ extern XLogRecPtr log_umbra_range_remap_compact(RelFileLocator rlocator,
 extern XLogRecPtr log_umbra_skip_wal_dense_map(RelFileLocator rlocator,
 											   uint16 count,
 											   const xl_umbra_skip_wal_dense_map_entry *entries);
+extern XLogRecPtr log_umbra_reclaim_unlink(RelFileLocator rlocator,
+										   ForkNumber forknum,
+										   BlockNumber segno);

 extern void umbra_redo(XLogReaderState *record);
 extern void umbra_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 99fa9a6ede..f11dd11c38 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6051,6 +6051,26 @@
 { oid => '2859', descr => 'statistics: number of buffer allocations',
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+{ oid => '9781',
+  descr => 'statistics: number of map compactor block relocations',
+  proname => 'pg_stat_get_map_compactor_relocations', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_map_compactor_relocations' },
+{ oid => '9785',
+  descr => 'statistics: number of map reclaim entries enqueued',
+  proname => 'pg_stat_get_map_reclaim_enqueued', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_map_reclaim_enqueued' },
+{ oid => '9782',
+  descr => 'statistics: number of map reclaim entries processed',
+  proname => 'pg_stat_get_map_reclaim_processed', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_map_reclaim_processed' },
+{ oid => '9784',
+  descr => 'statistics: number of map reclaim attempts failed',
+  proname => 'pg_stat_get_map_reclaim_failed', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_map_reclaim_failed' },

 { oid => '6214', descr => 'statistics: per backend type IO statistics',
   proname => 'pg_stat_get_io', prorows => '30', proretset => 't',
diff --git a/src/include/postmaster/mapwriter.h b/src/include/postmaster/mapwriter.h
index 6c984922b0..6b5dae25f5 100644
--- a/src/include/postmaster/mapwriter.h
+++ b/src/include/postmaster/mapwriter.h
@@ -17,8 +17,12 @@ extern PGDLLIMPORT int MapWriterDelay;
 extern PGDLLIMPORT int MapWriterMaxPages;
 extern PGDLLIMPORT int MapWriterPreallocMaxRelations;
 extern PGDLLIMPORT double MapWriterLRUMultiplier;
+extern PGDLLIMPORT int MapCompactorDelay;
+extern PGDLLIMPORT int MapCompactorMaxRelations;
+extern PGDLLIMPORT int MapCompactorBusyAllocThreshold;

extern void MapBackgroundWorkersRegister(void);
extern void MapWriterMain(Datum arg);
+extern void MapCompactorMain(Datum arg);

 #endif							/* MAPWRITER_H */
diff --git a/src/include/storage/map.h b/src/include/storage/map.h
index c61414fd16..37b721bf83 100644
--- a/src/include/storage/map.h
+++ b/src/include/storage/map.h
@@ -89,10 +89,15 @@ typedef struct MapSharedData
 	slock_t		clock_lock;
 	int			first_free_buffer;	/* head of free list, -1 if empty */
 	int			mapwriter_procno;	/* procno to wake, -1 if none */
+	int			mapcompactor_procno;	/* procno to wake, -1 if none */

 	/* statistics */
 	pg_atomic_uint32 num_allocs;
 	uint32		complete_passes;
+	pg_atomic_uint64 map_compactor_relocations;
+	pg_atomic_uint64 map_reclaim_enqueued;
+	pg_atomic_uint64 map_reclaim_processed;
+	pg_atomic_uint64 map_reclaim_failed;

/* configuration */
int num_slots;
@@ -130,6 +135,14 @@ typedef struct MapInflightBarrier

 extern void MapBackendInit(void);
 extern const ShmemCallbacks MapShmemCallbacks;
+extern void MapStatsAddCompactorRelocations(uint64 count);
+extern void MapStatsAddReclaimEnqueued(uint64 count);
+extern void MapStatsAddReclaimProcessed(uint64 count);
+extern void MapStatsAddReclaimFailed(uint64 count);
+extern uint64 MapStatsGetCompactorRelocations(void);
+extern uint64 MapStatsGetReclaimEnqueued(void);
+extern uint64 MapStatsGetReclaimProcessed(void);
+extern uint64 MapStatsGetReclaimFailed(void);

 /* Lookup/modification */
 extern bool MapTryLookup(UmbraFileContext *map_ctx, RelFileLocator rnode,
@@ -262,7 +275,10 @@ extern int	MapSyncStart(uint32 *complete_passes, uint32 *num_allocs);
 extern uint32 MapAllocPressurePeek(void);
 extern void MapStrategyNotifyWriter(int mapwriter_procno);
 extern void MapWakeWriter(void);
+extern void MapStrategyNotifyCompactor(int mapcompactor_procno);
+extern void MapWakeCompactor(void);
 extern int	MapPreallocStep(int max_relations);
+extern int	MapCompactorStep(int max_relations);

 /* Map cache hash table (in mapclock.c) */
 extern int	MapCacheLookup(RelFileLocator rnode, ForkNumber forknum,
@@ -291,6 +307,10 @@ extern int	map_prealloc_fsm_batch;
 extern int	map_prealloc_vm_low;
 extern int	map_prealloc_vm_hard;
 extern int	map_prealloc_vm_batch;
+extern bool map_compactor_enable;
+extern int	map_compactor_extent_blocks;
+extern int	map_compactor_low_live_percent;
+extern int	map_compactor_max_moves;

 /* Global data (defined in map.c) */
 extern MapSharedData *MapShared;
diff --git a/src/include/storage/map_internal.h b/src/include/storage/map_internal.h
index 368b3da15a..491282c7b2 100644
--- a/src/include/storage/map_internal.h
+++ b/src/include/storage/map_internal.h
@@ -43,6 +43,7 @@ extern bool MapMaybePreallocateFork(UmbraFileContext *map_ctx,
 									RelFileLocator rnode,
 									ForkNumber forknum,
 									bool background_mode);
+extern void MapReclaimForgetRelation(RelFileLocator rnode);
 extern bool MapInflightTryClaim(UmbraFileContext *map_ctx,
 								RelFileLocator rnode,
 								ForkNumber forknum,
diff --git a/src/include/storage/mapsuper_internal.h b/src/include/storage/mapsuper_internal.h
index 5d64ddec87..fadbb7f269 100644
--- a/src/include/storage/mapsuper_internal.h
+++ b/src/include/storage/mapsuper_internal.h
@@ -42,6 +42,9 @@ typedef struct MapSuperEntry
 	BlockNumber	extending_target_main;
 	BlockNumber	extending_target_fsm;
 	BlockNumber	extending_target_vm;
+	BlockNumber	reclaim_boundary_main;
+	BlockNumber	reclaim_boundary_fsm;
+	BlockNumber	reclaim_boundary_vm;
 	int			next_free;
 	bool		in_use;
 	LWLock		lock;
@@ -147,6 +150,8 @@ extern void MapSuperDeleteEntry(RelFileLocator rnode);
 extern bool MapSuperForkExists(const MapSuperblock *super,
 							   ForkNumber forknum);
 extern uint32 MapSuperPreallocFlag(ForkNumber forknum);
+extern BlockNumber MapSuperGetReclaimBoundary(const MapSuperEntry *entry,
+											  ForkNumber forknum);
 extern void MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx,
 									   RelFileLocator rnode,
 									   ForkNumber forknum,
@@ -154,6 +159,14 @@ extern void MapSBlockBumpPhysicalState(UmbraFileContext *map_ctx,
 									   bool bump_next_free,
 									   bool bump_capacity,
 									   XLogRecPtr map_lsn);
+extern bool MapSBlockTryGetReclaimBoundary(UmbraFileContext *map_ctx,
+										   RelFileLocator rnode,
+										   ForkNumber forknum,
+										   BlockNumber *boundary_pblk);
+extern void MapSBlockAdvanceReclaimBoundary(UmbraFileContext *map_ctx,
+											RelFileLocator rnode,
+											ForkNumber forknum,
+											BlockNumber boundary_pblk);
 extern void MapSuperTableShmemRequest(void);
 extern void MapSuperTableShmemInit(void);
 extern void MapSuperTableShmemAttach(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index b7f95ed5d3..a946baaa0e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -133,6 +133,10 @@ extern void smgrcheckpointdatabasetablespaces(Oid dbid, int ntablespaces,
 extern void smgrinvalidatedatabasetablespaces(Oid dbid, int ntablespaces,
 											  const Oid *tablespace_ids);
 extern void smgrinvalidatedatabase(Oid dbid);
+extern uint64 smgrgetmapcompactorrelocations(void);
+extern uint64 smgrgetmapreclaimenqueued(void);
+extern uint64 smgrgetmapreclaimprocessed(void);
+extern uint64 smgrgetmapreclaimfailed(void);
 extern void smgrregistershutdowncleanup(void);
 extern void smgrmarkskipwalpending(RelFileLocator rlocator);
 extern void smgrclearskipwalpending(RelFileLocator rlocator);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index 559a8eea6c..f051b15748 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -24,6 +24,7 @@ typedef enum SyncRequestType
 {
 	SYNC_REQUEST,				/* schedule a call of sync function */
 	SYNC_UNLINK_REQUEST,		/* schedule a call of unlink function */
+	SYNC_RECLAIM_REQUEST,		/* schedule internal reclaim unlink */
 	SYNC_FORGET_REQUEST,		/* forget all calls for a tag */
 	SYNC_FILTER_REQUEST,		/* forget all calls satisfying match fn */
 } SyncRequestType;
@@ -39,9 +40,7 @@ typedef enum SyncRequestHandler
 	SYNC_HANDLER_COMMIT_TS,
 	SYNC_HANDLER_MULTIXACT_OFFSET,
 	SYNC_HANDLER_MULTIXACT_MEMBER,
-#ifdef USE_UMBRA
 	SYNC_HANDLER_UMBRA,
-#endif
 	SYNC_HANDLER_NONE,
 } SyncRequestHandler;

diff --git a/src/include/storage/umbra.h b/src/include/storage/umbra.h
index 0702f7b392..0cc6388b26 100644
--- a/src/include/storage/umbra.h
+++ b/src/include/storage/umbra.h
@@ -136,6 +136,10 @@ extern int umfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uin
 extern int umsyncfiletag(const FileTag *ftag, char *path);
 extern int umunlinkfiletag(const FileTag *ftag, char *path);
 extern bool umfiletagmatches(const FileTag *ftag, const FileTag *candidate);
+extern uint64 umgetmapcompactorrelocations(void);
+extern uint64 umgetmapreclaimenqueued(void);
+extern uint64 umgetmapreclaimprocessed(void);
+extern uint64 umgetmapreclaimfailed(void);

 /*
  * Runtime semantic helpers.
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index da020abc31..9d94dd548a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -67,9 +67,13 @@ tests += {
       't/056_umbra_truncate_superblock.pl',
       't/057_umbra_remap_crash_consistency.pl',
       't/058_umbra_2pc_remap_recovery.pl',
+      't/059_umbra_compactor_relocation.pl',
+      't/060_umbra_reclaim_checkpoint_counters.pl',
       't/061_umbra_fsm_vm_map_translation.pl',
       't/062_umbra_truncate_drop_crash_matrix.pl',
       't/063_umbra_mainfork_head_unlink_checkpoint.pl',
+      't/064_umbra_mainfork_internal_reclaim_seg0.pl',
+      't/065_umbra_mainfork_middle_reclaim_keep_seg0.pl',
       't/066_umbra_truncate_redo.pl',
       't/067_umbra_remap_redo.pl',
       't/068_umbra_old_baseline_checkpoint_window.pl',
diff --git a/src/test/recovery/t/059_umbra_compactor_relocation.pl b/src/test/recovery/t/059_umbra_compactor_relocation.pl
new file mode 100644
index 0000000000..03e3a0a643
--- /dev/null
+++ b/src/test/recovery/t/059_umbra_compactor_relocation.pl
@@ -0,0 +1,91 @@
+# Verify map compactor relocation survives crash restart.
+#
+# This is UMBRA-specific and skipped in md mode.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+log_min_messages = debug1
+map_superblocks = 50000
+map_compactor_enable = off
+map_compactor_extent_blocks = 128
+map_compactor_low_live_percent = 100
+map_compactor_max_moves = 64
+mapcompactor_delay = 10ms
+mapcompactor_max_relations = 64
+mapcompactor_busy_alloc_threshold = 0
+});
+$node->start();
+
+$node->safe_psql('postgres',
+	q{CREATE TABLE umb_compact_t(id int PRIMARY KEY, payload text);});
+
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO umb_compact_t
+SELECT g, repeat('x', 700) FROM generate_series(1, 25000) g;
+CHECKPOINT;
+UPDATE umb_compact_t
+SET payload = md5(id::text) || repeat('u', 668)
+WHERE id % 2 = 0;
+});
+
+$node->safe_psql(
+	'postgres', q{
+ALTER SYSTEM SET map_compactor_enable = on;
+SELECT pg_reload_conf();
+SELECT pg_sleep(1.0);
+});
+
+ok($node->poll_query_until(
+		'postgres',
+		q{SELECT pg_stat_get_map_compactor_relocations() > 0;}),
+	'map compactor relocation stats become visible');
+
+my $before = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum(id)::bigint
+FROM umb_compact_t;
+});
+
+$node->stop('immediate');
+$node->start();
+
+my $after = $node->safe_psql(
+	'postgres', q{
+SELECT count(*) || ',' ||
+	   sum(length(payload))::bigint || ',' ||
+	   sum(id)::bigint
+FROM umb_compact_t;
+});
+
+is($after, $before, 'aggregate state preserved after compactor + crash restart');
+
+my $idx_count = $node->safe_psql(
+	'postgres', q{
+SET enable_seqscan = off;
+SELECT count(*) FROM umb_compact_t WHERE id BETWEEN 100 AND 24000;
+});
+my $seq_count = $node->safe_psql(
+	'postgres', q{
+SET enable_indexscan = off;
+SET enable_bitmapscan = off;
+SELECT count(*) FROM umb_compact_t WHERE id BETWEEN 100 AND 24000;
+});
+is($idx_count, $seq_count, 'index path and seq path return same rowcount');
+
+done_testing();
diff --git a/src/test/recovery/t/060_umbra_reclaim_checkpoint_counters.pl b/src/test/recovery/t/060_umbra_reclaim_checkpoint_counters.pl
new file mode 100644
index 0000000000..5f1e3730f0
--- /dev/null
+++ b/src/test/recovery/t/060_umbra_reclaim_checkpoint_counters.pl
@@ -0,0 +1,82 @@
+# Verify reclaim counters remain sane across checkpoint when punch is disabled.
+#
+# This is UMBRA-specific and skipped in md mode.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+checkpoint_timeout = '30min'
+max_wal_size = '4GB'
+log_min_messages = debug1
+map_superblocks = 50000
+map_compactor_enable = off
+map_compactor_extent_blocks = 128
+map_compactor_low_live_percent = 100
+map_compactor_max_moves = 4096
+mapcompactor_delay = 20ms
+mapcompactor_max_relations = 128
+mapcompactor_busy_alloc_threshold = 0
+});
+$node->start();
+
+$node->safe_psql('postgres',
+	q{CREATE TABLE umb_reclaim_t(id int PRIMARY KEY, payload text);});
+
+$node->safe_psql(
+	'postgres', q{
+INSERT INTO umb_reclaim_t
+SELECT g, repeat('x', 700) FROM generate_series(1, 30000) g;
+CHECKPOINT;
+UPDATE umb_reclaim_t
+SET payload = md5(id::text) || repeat('u', 668)
+WHERE id % 2 = 0;
+});
+
+$node->safe_psql(
+	'postgres', q{
+ALTER SYSTEM SET map_compactor_enable = on;
+SELECT pg_reload_conf();
+SELECT pg_sleep(1.5);
+});
+
+ok($node->poll_query_until(
+		'postgres',
+		q{SELECT pg_stat_get_map_compactor_relocations() > 0;}),
+	'map compactor produced relocations');
+
+my ($processed_before, $failed_before) = split(/\|/, $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_reclaim_processed(),
+	          pg_stat_get_map_reclaim_failed();}));
+my $attempt_before = $processed_before + $failed_before;
+
+$node->safe_psql('postgres', q{CHECKPOINT;});
+
+ok($node->safe_psql(
+		'postgres',
+		"SELECT (pg_stat_get_map_reclaim_processed() + pg_stat_get_map_reclaim_failed()) >= $attempt_before;") eq 't',
+	'reclaim counters remain monotonic after checkpoint');
+
+my ($processed_after, $failed_after) = split(/\|/, $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_reclaim_processed(),
+	          pg_stat_get_map_reclaim_failed();}));
+cmp_ok($processed_after + $failed_after, '>=', $attempt_before,
+	'reclaim attempt counters remain monotonic');
+
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_reclaim_t;}),
+   '30000', 'table remains readable after checkpoint');
+
+done_testing();
diff --git a/src/test/recovery/t/064_umbra_mainfork_internal_reclaim_seg0.pl b/src/test/recovery/t/064_umbra_mainfork_internal_reclaim_seg0.pl
new file mode 100644
index 0000000000..d5bc351e10
--- /dev/null
+++ b/src/test/recovery/t/064_umbra_mainfork_internal_reclaim_seg0.pl
@@ -0,0 +1,283 @@
+# Verify UMBRA internal reclaim can physically remove MAIN seg0.
+#
+# This test targets internal physical deletion (reclaim), not DROP/TRUNCATE:
+# - keep the relation alive
+# - churn only a tail key range until MAIN seg1 exists
+# - keep a large untouched prefix so seg0 must still have live mappings
+# - enable compactor and use checkpoint rounds to flush map state to disk
+# - verify seg0 disappears while seg1 still exists
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+$PostgreSQL::Test::Utils::timeout_default = 600;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init(allows_streaming => 1);
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+checkpoint_timeout = '30min'
+max_wal_size = '4GB'
+log_min_messages = debug1
+map_superblocks = 50000
+map_compactor_enable = off
+map_compactor_extent_blocks = 131072
+map_compactor_low_live_percent = 100
+map_compactor_max_moves = 2000000
+mapcompactor_delay = 10ms
+mapcompactor_max_relations = 50000
+mapcompactor_busy_alloc_threshold = 0
+});
+$node->start();
+
+my $map_mode = $node->safe_psql(
+	'postgres', q{
+CREATE OR REPLACE FUNCTION umb_count_mapped_in_seg(rel regclass, segno int)
+RETURNS bigint
+LANGUAGE plpgsql
+AS $$
+DECLARE
+	path text;
+	bs int;
+	seg_blocks bigint;
+	nblocks bigint;
+	lblk bigint;
+	fork_page_idx bigint;
+	group_no bigint;
+	page_no bigint;
+	cur_page_no bigint := -1;
+	page bytea;
+	off int;
+	pblk bigint;
+	seg_lo bigint;
+	seg_hi bigint;
+	cnt bigint := 0;
+BEGIN
+	path := pg_relation_filepath(rel) || '_map';
+	bs := current_setting('block_size')::int;
+	seg_blocks := (1024::bigint * 1024 * 1024) / bs;
+	seg_lo := segno::bigint * seg_blocks;
+	seg_hi := seg_lo + seg_blocks;
+	nblocks := pg_relation_size(rel) / bs;
+
+	FOR lblk IN 0 .. (nblocks - 1) LOOP
+		fork_page_idx := lblk / 2048;
+		group_no := fork_page_idx / 8192;
+		page_no := 3 + group_no * 8194 + (fork_page_idx % 8192);
+		IF page_no <> cur_page_no THEN
+			page := pg_read_binary_file(path, page_no * bs, bs, true);
+			cur_page_no := page_no;
+		END IF;
+
+		off := ((lblk % 2048) * 4)::int;
+		IF page IS NULL OR length(page) < off + 4 THEN
+			CONTINUE;
+		END IF;
+		pblk := get_byte(page, off)::bigint
+			  + (get_byte(page, off + 1)::bigint << 8)
+			  + (get_byte(page, off + 2)::bigint << 16)
+			  + (get_byte(page, off + 3)::bigint << 24);
+
+		IF pblk <> 4294967295::bigint AND pblk >= seg_lo AND pblk < seg_hi THEN
+			cnt := cnt + 1;
+		END IF;
+	END LOOP;
+
+	RETURN cnt;
+END;
+$$;
+
+CREATE TABLE umb_reclaim_seg0_t(id int PRIMARY KEY, payload text);
+INSERT INTO umb_reclaim_seg0_t
+SELECT g, repeat('x', 2000) FROM generate_series(1, 40000) g;
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_reclaim_seg0_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+my $main_path = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_filepath('umb_reclaim_seg0_t');}
+);
+
+$node->backup('bkp_reclaim_seg0');
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($node, 'bkp_reclaim_seg0', has_streaming => 1);
+$standby->start;
+is($standby->safe_psql('postgres',
+		q{SELECT pg_relation_filepath('umb_reclaim_seg0_t');}),
+	$main_path,
+	'primary and standby see same relpath');
+
+my $logical_blocks = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_size('umb_reclaim_seg0_t') / current_setting('block_size')::int;}
+);
+cmp_ok($logical_blocks, '>', 0, 'table has non-zero logical blocks');
+cmp_ok($logical_blocks, '<', 131072,
+	'table initially fits in MAIN seg0 before churn');
+
+my $seg1_size = -1;
+is($seg1_size, -1,
+	'MAIN seg1 does not exist before boundary-crossing partial churn');
+
+my $mapped_seg0 = $node->safe_psql(
+	'postgres',
+	q{SELECT umb_count_mapped_in_seg('umb_reclaim_seg0_t'::regclass, 0);}
+);
+cmp_ok($mapped_seg0, '>', 0,
+	'current mappings initially reside in seg0 before final crossing phase');
+
+# Drive the physical frontier to seg1 by repeatedly updating only the tail
+# portion of the table. The untouched prefix should keep seg0 live, while the
+# churned tail should eventually push new mappings into seg1.
+my $cross_rounds = 0;
+while ($seg1_size < 16 * 1024 * 1024)
+{
+	$cross_rounds++;
+	die "tail churn did not push MAIN seg1 past prealloc-only size\n"
+		if $cross_rounds > 64;
+
+	$node->safe_psql(
+		'postgres',
+		"UPDATE umb_reclaim_seg0_t " .
+		"SET payload = md5((id + ($cross_rounds * 1000000))::text) || repeat('u', 1968) " .
+		"WHERE id > 20000;");
+
+	$seg1_size = $node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path.1', true)).size, -1);");
+}
+
+cmp_ok($seg1_size, '>=', 16 * 1024 * 1024,
+	'MAIN seg1 grew beyond prealloc-only tail after controlled churn');
+
+# The physical frontier can cross into seg1 before any *current* mapping is
+# redirected there. Keep churning the tail until the relation's logical size
+# itself also extends beyond seg0, guaranteeing that some live mappings now
+# reside in seg1.
+my $post_cross_logical = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_size('umb_reclaim_seg0_t') / current_setting('block_size')::int;}
+);
+my $post_cross_round = 0;
+while ($post_cross_logical <= 140000)
+{
+	$post_cross_round++;
+	die "post-cross tail churn did not push logical blocks into seg1\n"
+		if $post_cross_round > 24;
+
+	$node->safe_psql(
+		'postgres',
+		"UPDATE umb_reclaim_seg0_t " .
+		"SET payload = md5((id + (9000000 + $post_cross_round * 1000000))::text) || repeat('v', 1968) " .
+		"WHERE id > 20000;");
+	$post_cross_logical = $node->safe_psql(
+		'postgres',
+		q{SELECT pg_relation_size('umb_reclaim_seg0_t') / current_setting('block_size')::int;}
+	);
+}
+cmp_ok($post_cross_logical, '>', 131072,
+	'logical relation size extends into seg1 before compactor reclaim');
+
+is($node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) >= 0;"),
+	't',
+	'MAIN seg0 still exists before compactor reclaim');
+
+is($node->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM umb_reclaim_seg0_t WHERE id <= 20000;}),
+	'20000',
+	'untouched prefix remains visible before compactor reclaim');
+
+my $reloc_before = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_compactor_relocations();}
+);
+my $reclaim_processed_before = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_reclaim_processed();}
+);
+my $reclaim_enqueued_before = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_reclaim_enqueued();}
+);
+
+$node->safe_psql(
+	'postgres', q{
+ALTER SYSTEM SET map_compactor_enable = on;
+SELECT pg_reload_conf();
+});
+
+my $removed_on_primary = 'f';
+my $reclaim_processed = 'f';
+my $reloc_advanced = 'f';
+my $reclaim_enqueued = 'f';
+for my $round (1 .. 60)
+{
+	$node->safe_psql('postgres', q{CHECKPOINT; SELECT pg_sleep(0.5);});
+	$reloc_advanced = $node->safe_psql(
+		'postgres',
+		"SELECT pg_stat_get_map_compactor_relocations() > $reloc_before;");
+	last if $reloc_advanced eq 't';
+}
+is($reloc_advanced, 't',
+	'compactor relocation counter advanced before internal reclaim completed');
+for my $round (1 .. 120)
+{
+	$node->safe_psql('postgres', q{CHECKPOINT; SELECT pg_sleep(0.5);});
+	$reclaim_enqueued = $node->safe_psql(
+		'postgres',
+		"SELECT pg_stat_get_map_reclaim_enqueued() > $reclaim_enqueued_before;");
+	last if $reclaim_enqueued eq 't';
+}
+
+for my $round (1 .. 20)
+{
+	$node->safe_psql('postgres', q{CHECKPOINT; SELECT pg_sleep(1.0);});
+	$reclaim_processed = $node->safe_psql(
+		'postgres',
+		"SELECT pg_stat_get_map_reclaim_processed() > $reclaim_processed_before;");
+	$removed_on_primary = $node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) = -1;");
+	last if $reclaim_processed eq 't' && $removed_on_primary eq 't';
+}
+is($reclaim_processed, 't',
+	'internal reclaim processed seg0 after compactor relocation');
+is($removed_on_primary, 't',
+	'MAIN seg0 is physically removed by internal reclaim after checkpoint rounds');
+$reclaim_enqueued = $node->safe_psql(
+	'postgres',
+	"SELECT pg_stat_get_map_reclaim_enqueued() > $reclaim_enqueued_before;");
+is($reclaim_enqueued, 't',
+	'internal reclaim was enqueued before seg0 was physically removed');
+
+my $until_lsn = $node->safe_psql('postgres', q{SELECT pg_current_wal_lsn();});
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_lsn();"),
+	'standby caught up to reclaim WAL');
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) = -1;"),
+	'standby MAIN seg0 is physically removed after replay');
+
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_reclaim_seg0_t;}),
+	'40000',
+	'relation remains readable after internal seg0 reclaim');
+is($standby->safe_psql('postgres', q{SELECT count(*) FROM umb_reclaim_seg0_t;}),
+	'40000',
+	'standby relation remains readable after replay');
+
+done_testing();
diff --git a/src/test/recovery/t/065_umbra_mainfork_middle_reclaim_keep_seg0.pl b/src/test/recovery/t/065_umbra_mainfork_middle_reclaim_keep_seg0.pl
new file mode 100644
index 0000000000..e863acd958
--- /dev/null
+++ b/src/test/recovery/t/065_umbra_mainfork_middle_reclaim_keep_seg0.pl
@@ -0,0 +1,356 @@
+# Verify UMBRA internal reclaim can remove a middle MAIN segment while seg0
+# remains present.
+#
+# Target behavior:
+# - relation stays alive
+# - middle segment (.1) is physically removed by internal reclaim after checkpoint
+# - seg0 file remains present
+# - after reclaim, writes on logical seg1 range + checkpoints keep primary/standby consistent
+#
+# In md mode, skip this test.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+$PostgreSQL::Test::Utils::timeout_default = 600;
+
+plan skip_all => 'requires --with-umbra MAP fork'
+	unless check_pg_config('^#define USE_UMBRA 1$');
+
+sub seg_path
+{
+	my ($main_path, $segno) = @_;
+
+	return $segno == 0 ? $main_path : "$main_path.$segno";
+}
+
+sub highest_existing_segno
+{
+	my ($node, $main_path, $max_segno) = @_;
+	my $highest = 0;
+
+	for my $segno (0 .. $max_segno)
+	{
+		my $exists = $node->safe_psql(
+			'postgres',
+			"SELECT COALESCE((pg_stat_file('" . seg_path($main_path, $segno) .
+			"', true)).size, -1) >= 0;");
+		$highest = $segno if $exists eq 't';
+	}
+
+	return $highest;
+}
+
+sub ctid_id_band_for_segno
+{
+	my ($node, $relname, $segno) = @_;
+	my $sql = qq{
+WITH params AS (
+	SELECT (1024::bigint * 1024 * 1024) /
+		   current_setting('block_size')::int AS seg_blocks
+),
+seg AS (
+	SELECT min(id) AS min_id,
+		   max(id) AS max_id,
+		   count(*) AS cnt
+	FROM $relname, params
+	WHERE split_part(trim(both '()' from ctid::text), ',', 1)::bigint >=
+		  $segno * seg_blocks
+	  AND split_part(trim(both '()' from ctid::text), ',', 1)::bigint <
+		  ($segno + 1) * seg_blocks
+)
+SELECT COALESCE(min_id::text, '') || '|' ||
+	   COALESCE(max_id::text, '') || '|' ||
+	   cnt::text
+FROM seg;
+};
+	my ($min_id, $max_id, $cnt) =
+	  split(/\|/, $node->safe_psql('postgres', $sql));
+
+	return ($min_id, $max_id, $cnt);
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init(allows_streaming => 1);
+$node->append_conf(
+	'postgresql.conf', qq{
+autovacuum = off
+full_page_writes = on
+checkpoint_timeout = '1d'
+max_wal_size = '64GB'
+wal_keep_size = '1GB'
+log_min_messages = debug1
+map_superblocks = 50000
+map_compactor_enable = off
+map_compactor_extent_blocks = 131072
+map_compactor_low_live_percent = 60
+map_compactor_max_moves = 2000000
+mapcompactor_delay = 10ms
+mapcompactor_max_relations = 256
+mapcompactor_busy_alloc_threshold = 0
+});
+$node->start();
+
+my $map_mode = $node->safe_psql(
+	'postgres', q{
+CREATE TABLE umb_reclaim_mid_t(id int PRIMARY KEY, payload text);
+ALTER TABLE umb_reclaim_mid_t ALTER COLUMN payload SET STORAGE PLAIN;
+INSERT INTO umb_reclaim_mid_t
+SELECT g, repeat('x', 7000) FROM generate_series(1, 300000) g;
+SELECT COALESCE(encode(pg_read_binary_file(pg_relation_filepath('umb_reclaim_mid_t') || '_map', 0, 1, true), 'hex'), '') <> '';
+});
+
+my $main_path = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_relation_filepath('umb_reclaim_mid_t');}
+);
+my $max_table_id = 300000;
+my $phase2_max_update_id = $max_table_id;
+my $target_middle_segno = 1;
+my ($phase1_min_id, $phase1_max_id, $phase1_cnt) =
+  ctid_id_band_for_segno($node, 'umb_reclaim_mid_t', $target_middle_segno);
+my $seg_prealloc_bytes = 4 * 1024 * 1024;
+
+is($node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) >= 0;"),
+	't',
+	'MAIN seg0 exists after load');
+is($node->safe_psql(
+			'postgres',
+			"SELECT COALESCE((pg_stat_file('$main_path.1', true)).size, -1) >= 0;"),
+	't',
+	'MAIN seg1 exists after load');
+cmp_ok($phase1_cnt, '>', 0, 'identified initial seg1 id band before frontier push');
+is($phase1_max_id - $phase1_min_id + 1,
+   $phase1_cnt,
+   'initial seg1 id band is contiguous');
+
+my $frontier_crossed_seg1 = 'f';
+my $seg2_size_after_load = $node->safe_psql(
+	'postgres',
+	"SELECT COALESCE((pg_stat_file('" . seg_path($main_path, 2) .
+	"', true)).size, -1);");
+my $update_plan = $node->safe_psql(
+	'postgres', qq{
+SET enable_seqscan = off;
+SET enable_bitmapscan = off;
+SET enable_indexscan = on;
+EXPLAIN (COSTS OFF)
+UPDATE umb_reclaim_mid_t
+SET payload = md5((id + 1000000)::text) || repeat('u', 6968)
+WHERE id BETWEEN $phase1_min_id AND $phase1_max_id;
+});
+
+like($update_plan, qr/Index Scan using umb_reclaim_mid_t_pkey/i,
+	'update path uses primary-key index scan');
+unlike($update_plan, qr/Seq Scan|Bitmap/i,
+	'update path avoids seqscan/bitmapscan');
+
+	$frontier_crossed_seg1 = 't' if $seg2_size_after_load > $seg_prealloc_bytes;
+
+is($frontier_crossed_seg1, 't',
+   'initial load already pushed seg2 past the 4MB preallocation watermark');
+
+my ($target_min_id, $target_max_id, $target_cnt) =
+  ctid_id_band_for_segno($node, 'umb_reclaim_mid_t', $target_middle_segno);
+cmp_ok($target_cnt, '>', 0,
+	're-identified current seg1 id band after frontier crossed seg1');
+is($target_max_id - $target_min_id + 1,
+   $target_cnt,
+   'current seg1 id band is contiguous before backup');
+
+$node->backup('bkp_reclaim_mid');
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($node, 'bkp_reclaim_mid', has_streaming => 1);
+$standby->start;
+is($standby->safe_psql('postgres',
+		q{SELECT pg_relation_filepath('umb_reclaim_mid_t');}),
+	$main_path,
+	'primary and standby see same relpath');
+
+$node->safe_psql(
+	'postgres', q{
+ALTER SYSTEM SET map_compactor_enable = on;
+SELECT pg_reload_conf();
+SELECT pg_sleep(1.0);
+});
+
+my $enq_before = $node->safe_psql(
+	'postgres',
+	q{SELECT pg_stat_get_map_reclaim_enqueued();}
+);
+my $reclaim_enqueued = 'f';
+my ($phase2_min_id, $phase2_max_id, $phase2_cnt);
+my ($phase2_update_min_id, $phase2_update_max_id);
+my $phase2_floor_id = $phase1_min_id;
+
+# Phase 2: now that the frontier is in seg2+ and standby has copied the shaped
+# state, repeatedly cross checkpoint boundaries and rewrite the current seg1 id
+# band. Keep the lower bound anchored at the current seg1 band so we do not
+# punch seg0, but let the upper bound run past the original 160000-row load to
+# keep pressure on tail allocations.
+for my $i (1 .. 12)
+{
+	my $seed = 10000000 + $i * 1000000;
+
+	$node->safe_psql('postgres', q{CHECKPOINT;});
+	($phase2_min_id, $phase2_max_id, $phase2_cnt) =
+	  ctid_id_band_for_segno($node, 'umb_reclaim_mid_t', $target_middle_segno);
+	last if $phase2_cnt <= 0;
+	$phase2_update_min_id =
+	  $phase2_min_id > $phase2_floor_id ? $phase2_min_id : $phase2_floor_id;
+	$phase2_update_max_id = $phase2_max_update_id;
+	$node->safe_psql(
+		'postgres',
+		"SET enable_seqscan = off; " .
+		"SET enable_bitmapscan = off; " .
+		"SET enable_indexscan = on; " .
+		"UPDATE umb_reclaim_mid_t " .
+		"SET payload = md5((id + $seed)::text) || repeat('w', 6968) " .
+		"WHERE id BETWEEN $phase2_update_min_id AND $phase2_update_max_id;");
+
+	$reclaim_enqueued = $node->safe_psql(
+		'postgres',
+		"SELECT pg_stat_get_map_reclaim_enqueued() > $enq_before;");
+	last if $reclaim_enqueued eq 't';
+}
+
+if ($reclaim_enqueued ne 't')
+{
+	for my $round (1 .. 24)
+	{
+		$node->safe_psql('postgres', q{CHECKPOINT; SELECT pg_sleep(0.5);});
+		$reclaim_enqueued = $node->safe_psql(
+			'postgres',
+			"SELECT pg_stat_get_map_reclaim_enqueued() > $enq_before;");
+		last if $reclaim_enqueued eq 't';
+	}
+}
+
+is($reclaim_enqueued, 't',
+	'compactor enqueued reclaim unlink after seg1 pages were remapped into tail space');
+
+my $middle_seg_removed_on_primary = 'f';
+for my $round (1 .. 16)
+{
+	$node->safe_psql('postgres', q{CHECKPOINT; SELECT pg_sleep(0.2);});
+
+	my $removed = $node->safe_psql(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('" .
+		seg_path($main_path, $target_middle_segno) .
+		"', true)).size, -1) = -1;");
+	$middle_seg_removed_on_primary = 't' if $removed eq 't';
+	last if $middle_seg_removed_on_primary eq 't';
+}
+is($middle_seg_removed_on_primary, 't',
+	'segment 1 is physically removed by internal reclaim after checkpoint rounds');
+ok($node->poll_query_until(
+			'postgres',
+			"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) >= 0;"),
+	'MAIN seg0 file still exists after middle-segment reclaim');
+
+my $until_lsn = $node->safe_psql('postgres', q{SELECT pg_current_wal_lsn();});
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_lsn();"),
+	'standby caught up to reclaim WAL');
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('" .
+		seg_path($main_path, $target_middle_segno) .
+		"', true)).size, -1) = -1;"),
+	'standby segment 1 is physically removed after replay');
+ok($standby->poll_query_until(
+				'postgres',
+				"SELECT COALESCE((pg_stat_file('$main_path', true)).size, -1) >= 0;"),
+	'standby MAIN seg0 file still exists after replay');
+
+# Post-reclaim stability round:
+# run further writes including logical seg1 ranges, checkpoint, and verify
+# primary/standby remain consistent while seg1 stays absent.
+$node->safe_psql(
+	'postgres', qq{
+UPDATE umb_reclaim_mid_t
+SET payload = md5((id + 7000000)::text) || repeat('r', 6968)
+WHERE id BETWEEN 1 AND 5000;
+
+UPDATE umb_reclaim_mid_t
+SET payload = md5((id + 9000000)::text) || repeat('s', 6968)
+WHERE id BETWEEN $target_min_id AND $target_max_id;
+
+DELETE FROM umb_reclaim_mid_t WHERE id BETWEEN 20001 AND 21000;
+INSERT INTO umb_reclaim_mid_t
+SELECT g, md5((g + 12000000)::text) || repeat('n', 6968)
+FROM generate_series(20001, 21000) g;
+
+CHECKPOINT;
+});
+
+my $stability_lsn = $node->safe_psql('postgres', q{SELECT pg_current_wal_lsn();});
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT '$stability_lsn'::pg_lsn <= pg_last_wal_replay_lsn();"),
+	'standby caught up after post-reclaim write round');
+
+ok($node->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('" .
+		seg_path($main_path, $target_middle_segno) .
+		"', true)).size, -1) = -1;"),
+	'primary segment 1 stays absent after post-reclaim writes');
+ok($standby->poll_query_until(
+		'postgres',
+		"SELECT COALESCE((pg_stat_file('" .
+		seg_path($main_path, $target_middle_segno) .
+		"', true)).size, -1) = -1;"),
+	'standby segment 1 stays absent after replay of post-reclaim writes');
+
+my $primary_fp = $node->safe_psql(
+	'postgres', q{
+SELECT format('%s|%s|%s|%s',
+			  count(*),
+			  min(id),
+			  max(id),
+			  sum((id::bigint * 3 + length(payload))::bigint))
+FROM umb_reclaim_mid_t;
+});
+is($standby->safe_psql(
+		'postgres', q{
+SELECT format('%s|%s|%s|%s',
+			  count(*),
+			  min(id),
+			  max(id),
+			  sum((id::bigint * 3 + length(payload))::bigint))
+FROM umb_reclaim_mid_t;
+}),
+	$primary_fp,
+	'primary/standby aggregate fingerprint matches after post-reclaim writes');
+
+my $primary_sample = $node->safe_psql(
+	'postgres', qq{
+SELECT string_agg(id::text || ':' || left(payload, 12), ',' ORDER BY id)
+FROM umb_reclaim_mid_t
+WHERE id IN (1, 5000, 20001, 21000, $target_min_id, $target_max_id);
+});
+is($standby->safe_psql(
+		'postgres', qq{
+SELECT string_agg(id::text || ':' || left(payload, 12), ',' ORDER BY id)
+FROM umb_reclaim_mid_t
+WHERE id IN (1, 5000, 20001, 21000, $target_min_id, $target_max_id);
+}),
+	$primary_sample,
+	'primary/standby sample rows match after post-reclaim writes');
+
+is($node->safe_psql('postgres', q{SELECT count(*) FROM umb_reclaim_mid_t;}),
+	$max_table_id,
+	'relation remains readable after middle-segment reclaim');
+is($standby->safe_psql('postgres', q{SELECT count(*) FROM umb_reclaim_mid_t;}),
+	$max_table_id,
+	'standby relation remains readable after replay');
+
+done_testing();
-- 
2.50.1 (Apple Git-155)

#13

Tom Lane

tgl@sss.pgh.pa.us

22 days ago

In reply to: Mingwei Jia (#12)

Re: [RFC PATCH v2 RESEND 10/10] umbra: add patch 9 compactor framework and non-interference policy

This is still not working :-(.

One patch per email is not very useful: our cfbot is not smart enough
to realize that this is meant as a set of patches rather than a series
of updated versions of a single patch.

Also, I'm not quite sure what you are doing but it is not ending up as
an attached file. It really needs to be an attachment, not inline
text, so that it can be conveniently saved and won't get mangled by
anyone's mail reader.

You could make it work by attaching several patch files to one email.
But for a patch set as large as this, it's probably a better idea
to put the patch files into a tar.gz file and send that as a single
attachment.

regards, tom lane

#14

Mingwei Jia

i@nayishan.top

22 days ago

In reply to: Tom Lane (#13)

Re: [RFC PATCH v2 RESEND 0/10] Umbra: a remap-aware smgr prototype

Hi Tom,

Thanks for the clarification, and sorry for the noise.

Attached is a tar.gz archive containing the 10 patch files as a single
attachment, as suggested.

The archive contains:

umbra-rfc-v2-patches/0001-...
...
umbra-rfc-v2-patches/0010-...

This is the same RFC v2 patch set as before; only the delivery format is
changed.

SHA256:

2b47ef8e77d1860a2f652cc86a5d030f0bcbc466d5f0fbbf515b20b55835c4b7

Regards,
Mingwei Jia

#15

Bruce Momjian

bruce@momjian.us

22 days ago

In reply to: Mingwei Jia (#14)

Re: [RFC PATCH v2 RESEND 0/10] Umbra: a remap-aware smgr prototype

On Tue, Jun 2, 2026 at 08:05:00AM +0800, Mingwei Jia wrote:

Hi Tom,

Thanks for the clarification, and sorry for the noise.

Attached is a tar.gz archive containing the 10 patch files as a single
attachment, as suggested.

The archive contains:

umbra-rfc-v2-patches/0001-...
...
umbra-rfc-v2-patches/0010-...

This is the same RFC v2 patch set as before; only the delivery format is
changed.

SHA256:

2b47ef8e77d1860a2f652cc86a5d030f0bcbc466d5f0fbbf515b20b55835c4b7

Yes, I think that worked; the email had two parts, and the second part
was labeled correctly:

I 1 <no description> [text/plain, 8bit, utf-8, 0.4K]
A 2 umbra-rfc-v2-patches.tar.gz [applica/x-gzip, base64, 316K]

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#16

Robert Haas

robertmhaas@gmail.com

21 days ago

In reply to: Tom Lane (#13)

Re: [RFC PATCH v2 RESEND 10/10] umbra: add patch 9 compactor framework and non-interference policy

On Mon, Jun 1, 2026 at 7:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is still not working :-(.

One patch per email is not very useful: our cfbot is not smart enough
to realize that this is meant as a set of patches rather than a series
of updated versions of a single patch.

Also, I'm not quite sure what you are doing but it is not ending up as
an attached file. It really needs to be an attachment, not inline
text, so that it can be conveniently saved and won't get mangled by
anyone's mail reader.

You could make it work by attaching several patch files to one email.
But for a patch set as large as this, it's probably a better idea
to put the patch files into a tar.gz file and send that as a single
attachment.

Also, each patch will need a real commit message explaining its
purpose in detail. Like our actual commit messages.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Alvaro Herrera

alvherre@2ndquadrant.com

21 days ago

In reply to: Mingwei Jia (#6)

Re: [RFC PATCH v2 RESEND 04/10] umbra: add patch 3 metadata disk format and identity mapping bootstrap

On 2026-Jun-02, Mingwei Jia wrote:

diff --git a/src/backend/storage/map/map.c b/src/backend/storage/map/map.c
new file mode 100644
index 0000000000..563f38b21a
--- /dev/null
+++ b/src/backend/storage/map/map.c
@@ -0,0 +1,162 @@
+/*-------------------------------------------------------------------------
+ *
+ * map.c
+ *	  Umbra metadata-fork disk layout helpers.
+ *
+ * This file contains address-translation and in-page access routines for the
+ * metadata fork disk layout.
+ *
+ * src/backend/storage/map/map.c
+ *
+ *-------------------------------------------------------------------------
+ */

I find this pretty difficult to understand, and I think it's because the
fork system is the wrong abstraction. The current system is too
obviously centered around heapam and nbtree, to the detriment of
everything else. I would like a system whereby each (index, table) AM
can determine which forks exist and how to deal with each. For
instance, BRIN indexes have a "revmap" at the start of the main fork;
when it needs one more "revmap" page but it's already used by regular
index tuples, it needs to move all those existing index tuples to
another page so that the page can be used as a revmap page (I think this
is called "evacuate"). That's absurd and overcomplicated. If the
revmap had its own fork, a bunch of code and locking considerations
would go simply away.

I think that would also apply to this map thingy you want to create
(although I can't claim to have really understood what you're trying to
achieve).

Anyway, if I'm pointing in roughly the right direction, then I don't
like the idea of this new smgr creating yet another layer to paper over
that failed abstraction. Let's fix that instead.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#18

Mingwei Jia

i@nayishan.top

20 days ago

In reply to: Alvaro Herrera (#17)

Re: [RFC PATCH v2 RESEND 04/10] umbra: add patch 3 metadata disk format and identity mapping bootstrap

Hi Álvaro, all,

Thanks for the comments and help so far.

First, thanks to Tom, Bruce, and Robert for helping me get the
submission format into better shape. I will keep using the tar.gz
attachment format for future versions. Robert, I agree that the current
per-patch subjects are not enough. In v3 I will at least add real commit
messages to each patch, so that each patch explains what it adds and why
it is split that way.

Álvaro, I think your fork-abstraction point is a good question. It made
me think more about what the common code should provide and what the
owner of relation-local storage should provide.

Looking at the patch again, I think part of the problem is that I mixed
too many things into the smgr interface. There are lifecycle hooks,
runtime mapping calls, WAL/redo calls, background maintenance, and a few
statistics counters. That probably makes the code harder to understand
than it needs to be.

smgrisinternalfork() is one example where the boundary is not clean
enough. I am also not sure that the map statistics counters should be in
the smgr-facing API.

So before trying to solve the much larger owner-defined fork problem, I
think I should first clean up the Umbra smgr interface. The smgr may
still be the right owner for the lblk-to-pblk metadata, because code
above smgr should not know physical block numbers. But the way the
prototype exposes that metadata today needs work.

I also do not see the current smgr placement only as a shortcut around
the fork abstraction problem. The lblk-to-pblk map is part of the
physical placement policy of the storage manager, not table-AM or
index-AM contents. Even if PostgreSQL eventually has a more general
owner-defined fork facility, I think this particular kind of metadata may
still naturally be owned by the smgr implementation.

One complication is buffering. The metadata fork adds a separate shared
cache for MAP pages, outside PostgreSQL's normal buffer manager for
relation pages. That follows from making the remapped smgr own the
mapping metadata: the MAP cache stores translation metadata, while the
normal buffer manager continues to cache relation pages by logical block
identity. This is another reason why the owner-defined fork question
looks larger than just allowing an AM to declare more forks.

The reason Umbra needs this metadata is to make the logical/physical
split durable and redoable. For some ordinary updates after checkpoint,
PostgreSQL normally needs a full-page image in WAL because redo needs a
safe page image to start from. Umbra tries to replace that inline WAL
page image, in eligible cases, with a preserved old physical page.

In that model, old_pblk is the content baseline and new_pblk is the
WAL-owned physical target. WAL records the old physical block, the new
physical block, and the resulting mapping state. During redo, the old
physical page can be used as the baseline, the WAL delta is applied, and
the reconstructed page is written to the new physical location.

So the idea is not to disable full_page_writes globally. It is to move
the recovery baseline from an inline full-page image in WAL to an old
physical page preserved by the remap and reclaim machinery.

For v3, the immediate change I will make is to add real commit messages
to every patch. I will also make the cover letter point more directly to
the relevant design notes, especially the smgr-private metadata boundary
and the intended review scope.

I will also take the comments about the smgr-facing interfaces being hard
to understand into account. Larger changes to the interface shape, patch
boundaries, or code structure will need separate follow-up work, and I
expect to improve those parts incrementally. I will try to make the next
version easier to review.

Thanks again for the help so far. I think this direction is worth
exploring, and I would be very happy to keep working through the open
questions with others and see whether this approach can be made to work.

Regards,
Mingwei

[PoC] Umbra: a remap-aware smgr prototype on PostgreSQL master

Attachments: