RFC: PostgreSQL Storage I/O Transformation Hooks
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts
*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & Motivation
This RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert Groups
The core motivation of this proposal is *“Separation of Concerns and Mutual
Respect.”*
Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.
*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were mired in
the technicalities of “how to hardcode encryption into the core,” this
proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”
- *RDBMS Experts* provide a trusted pipeline responsible for data I/O
paths and consistency.
- *Security Experts* take responsibility for the specialized domain of
encryption algorithms and key management.
This hook system functions as a *Technical Protocol*—a high-level agreement
that allows these two expert groups to exchange data securely without
encroaching on each other’s territory.
------------------------------
2. Design Principles
1. *Delegation of Authority:* The core remains independent of specific
encryption standards, providing a “free territory” where security experts
can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are not
in use, ensuring the native performance of the PostgreSQL engine.
------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)
We allow intervention by security experts through five contact points along
the I/O path:
- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)
3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a page.
This serves as a *Status Message* sent by the security expert to the
engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of Conduct
for Security Experts
This reference implementation exists not as a commercial product, but to
define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.
1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations that
security logic must follow within “Critical Sections” to maintain system
stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression, auditing).
------------------------------
5. Scope
- *In-Scope:* Backend hook infrastructure, Transformation ID field, and
reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection of
specific cryptographic algorithms, and integration with external tools.
This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can perform
at their best.
Hello,
Following up on the RFC, I am submitting the initial patch set for the
proposed infrastructure. These patches introduce a minimal hook-based
protocol to allow extensions to handle data transformation, such as TDE,
while keeping the PostgreSQL core independent of specific cryptographic
implementations.
Implementation Details:
Hook Points in Storage I/O Path
The patch introduces five strategic hook points:
mdread_post_hook: Called after blocks are read from disk. The extension can
reverse-transform data in place.
mdwrite_pre_hook & mdextend_pre_hook: Called before writing or extending
blocks. These hooks return a pointer to transformed buffers.
xlog_insert_pre_hook & xlog_decode_pre_hook: Handle transformation for WAL
records during insertion and replay.
Data Integrity and Checksum Protocol
To ensure robust error detection, the hooks follow a specific verification
protocol:
On Write: The extension transforms the page, sets the Transform ID, then
recalculates the checksum on the transformed data.
On Read: The extension verifies the on-disk checksum of the transformed
data first. After reverse-transformation, it clears the Transform ID and
recalculates the checksum for the plaintext data. This ensures corruption
is detected regardless of the transformation state.
WAL Safety via XLR_BLOCK_ID_TRANSFORMED (251)
For WAL records, I have introduced a specific block ID (251) to mark
transformed data. If the decryption extension is not loaded, the WAL reader
will encounter this unknown block ID and fail-fast, preventing the system
from incorrectly interpreting encrypted data as valid WAL records.
PageHeader Transform ID (5-bit)
I have allocated bits 3-7 of pd_flags in the PageHeader for a Transform ID.
This allows the engine and extensions to identify the transformation state
of a page (e.g., key versioning or algorithm type) without attempting
decryption. It ensures backward compatibility: pages with Transform ID 0
are treated as standard untransformed pages.
Memory and Critical Section Safety
As demonstrated in the contrib/test_tde reference implementation, cipher
contexts are pre-allocated in _PG_init to avoid memory allocation during
critical sections. For WAL transformation,
MemoryContextAllowInCriticalSection() is used to allow buffer reallocation
within critical sections; if OOM occurs during buffer growth, it results in
a controlled PANIC.
Performance Considerations
When hooks are not set (default), the overhead is limited to a single NULL
pointer comparison per I/O operation. This is architecturally consistent
with existing PostgreSQL hooks and is designed to have a negligible impact
on performance.
Attached Patches:
v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patch: Core
infrastructure.
v20251228-0002-Add-test_tde-extension-for-TDE-testing.patch: Reference
implementation using AES-256-CTR.
I look forward to your comments and feedback.
Regards,
Henson Choi
2025년 12월 28일 (일) PM 4:49, Henson Choi <assam258@gmail.com>님이 작성:
Show quoted text
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & MotivationThis RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert GroupsThe core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were mired
in the technicalities of “how to hardcode encryption into the core,” this
proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”- *RDBMS Experts* provide a trusted pipeline responsible for data I/O
paths and consistency.
- *Security Experts* take responsibility for the specialized domain of
encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data securely
without encroaching on each other’s territory.
------------------------------
2. Design Principles1. *Delegation of Authority:* The core remains independent of specific
encryption standards, providing a “free territory” where security experts
can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are
not in use, ensuring the native performance of the PostgreSQL engine.------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)We allow intervention by security experts through five contact points
along the I/O path:- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a page.
This serves as a *Status Message* sent by the security expert to the
engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of Conduct
for Security ExpertsThis reference implementation exists not as a commercial product, but to
define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to maintain
system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression, auditing).------------------------------
5. Scope- *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection of
specific cryptographic algorithms, and integration with external tools.This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can perform
at their best.
Attachments:
v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchapplication/x-patch; name=v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchDownload
From 39d19fc7127124e007ce6bede487209afba6d827 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:50:12 +0900
Subject: [PATCH] Add Storage I/O Transform Hooks for PostgreSQL
This patch introduces a set of hook points that allow extensions to
intercept and transform data during storage I/O operations. The hooks
are designed to support transparent data encryption (TDE) and similar
use cases that require data transformation at the storage layer.
The following hooks are added:
- page_encrypt_hook / page_decrypt_hook in bufmgr.c for buffer page
transformation during read/write operations
- xlog_insert_pre_hook in xloginsert.c for WAL record transformation
before assembly
- xlog_decrypt_record_hook in xlogreader.c for WAL record
transformation during replay
- smgr_write_transform_hook / smgr_read_transform_hook in md.c for
low-level storage manager I/O transformation
Each hook is optional and defaults to NULL, ensuring no overhead when
extensions are not loaded.
Author: Henson Choi <assam258@gmail.com>
---
src/backend/access/transam/xloginsert.c | 10 ++++
src/backend/access/transam/xlogreader.c | 21 ++++++++
src/backend/storage/buffer/bufmgr.c | 9 ++++
src/backend/storage/smgr/md.c | 20 ++++++++
src/include/access/xloginsert.h | 20 ++++++++
src/include/access/xlogreader.h | 20 ++++++++
src/include/access/xlogrecord.h | 5 ++
src/include/storage/bufpage.h | 25 +++++++++-
src/include/storage/md.h | 65 +++++++++++++++++++++++++
9 files changed, 194 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a56d5a55282..f518ef3f16f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -136,6 +136,12 @@ static bool begininsert_called = false;
/* Memory context to hold the registered buffer and data references. */
static MemoryContext xloginsert_cxt;
+/*
+ * Hook variable for WAL insert transformation (e.g., encryption).
+ * Extensions can set this hook to transform WAL data before assembly.
+ */
+xlog_insert_pre_hook_type xlog_insert_pre_hook = NULL;
+
static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr RedoRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn, int *num_fpi,
@@ -526,6 +532,10 @@ XLogInsert(RmgrId rmid, uint8 info)
&fpw_lsn, &num_fpi, &fpi_bytes,
&topxid_included);
+ /* Pre-insert hook for transformation (e.g., encryption) */
+ if (xlog_insert_pre_hook)
+ rdt = xlog_insert_pre_hook(rdt);
+
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
fpi_bytes, topxid_included);
} while (!XLogRecPtrIsValid(EndPos));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5e5001b2101..169f2b06fc5 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -40,6 +40,13 @@
#include "common/logging.h"
#endif
+/*
+ * Hook variable for WAL record transformation (e.g., decryption).
+ * Extensions can set this hook to transform raw WAL data before decoding.
+ * Frontend tools can also set this hook at startup.
+ */
+xlog_decode_pre_hook_type xlog_decode_pre_hook = NULL;
+
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
pg_attribute_printf(2, 3);
static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -843,6 +850,11 @@ restart:
Assert(gotheader);
record = (XLogRecord *) state->readRecordBuf;
+
+ /* Pre-validation hook for transformation (e.g., decryption) */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, true);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
@@ -862,6 +874,15 @@ restart:
goto err;
/* Record does not cross a page boundary */
+
+ /*
+ * Pre-validation hook for transformation (e.g., decryption).
+ * inplace_allowed is false because record points to readBuf, which
+ * may be copied back to WAL files (e.g., FinishWalRecovery).
+ */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, false);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb55102b0d7..eb13a17fa94 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/md.h"
#include "storage/proc.h"
#include "storage/read_stream.h"
#include "storage/smgr.h"
@@ -7401,6 +7402,14 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ);
#endif
+ /* Decrypt block before checksum verification */
+ if (mdread_post_hook)
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+ mdread_post_hook(&rlocator, tag.forkNum, tag.blockNum, &bufdata, 1);
+ }
+
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 71bcdeb6601..5416128d2cc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -96,6 +96,14 @@ typedef struct _MdfdVec
static MemoryContext MdCxt; /* context for all MdfdVec objects */
+/*
+ * Hook variables for I/O transformation (e.g., encryption/decryption).
+ * Extensions can set these hooks to transform data during storage I/O.
+ */
+mdread_post_hook_type mdread_post_hook = NULL;
+mdwrite_pre_hook_type mdwrite_pre_hook = NULL;
+mdextend_pre_hook_type mdextend_pre_hook = NULL;
+
/* Populate a file tag describing an md.c segment file. */
#define INIT_MD_FILETAG(a,xx_rlocator,xx_forknum,xx_segno) \
@@ -513,6 +521,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
relpath(reln->smgr_rlocator, forknum).str,
InvalidBlockNumber)));
+ /* Pre-extend hook for transformation (e.g., encryption) */
+ if (mdextend_pre_hook)
+ buffer = mdextend_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -972,6 +984,10 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
}
+ /* Post-read hook for transformation (e.g., decryption) */
+ if (mdread_post_hook)
+ mdread_post_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks_this_segment);
+
nblocks -= nblocks_this_segment;
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
@@ -1064,6 +1080,10 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
#endif
+ /* Pre-write hook for transformation (e.g., encryption) */
+ if (mdwrite_pre_hook)
+ buffers = mdwrite_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks);
+
while (nblocks > 0)
{
struct iovec iov[PG_IOV_MAX];
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index d6a71415d4f..cc54459ad33 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -19,6 +19,26 @@
#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+/* Forward declaration for XLogRecData */
+struct XLogRecData;
+
+/*
+ * Hook function type for WAL insert transformation (e.g., encryption).
+ * Called after XLogRecordAssemble() but before XLogInsertRecord().
+ * Extension can transform the assembled WAL record data for encryption.
+ * Returns the (possibly modified) XLogRecData chain to be inserted.
+ *
+ * The first node's data points to XLogRecord header, which contains
+ * xl_rmid and xl_info if needed by the hook.
+ *
+ * On failure, the hook should either PANIC or return the original rdata
+ * as fallback.
+ */
+typedef struct XLogRecData *(*xlog_insert_pre_hook_type) (struct XLogRecData *rdata);
+
+/* Hook variable for WAL insert transformation */
+extern PGDLLIMPORT xlog_insert_pre_hook_type xlog_insert_pre_hook;
+
/*
* The minimum size of the WAL construction working area. If you need to
* register more than XLR_NORMAL_MAX_BLOCK_ID block references or have more
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index dfabbbd57d4..898d52a1013 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -400,6 +400,26 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
XLogRecPtr lsn,
char **errormsg);
+/*
+ * Hook function type for WAL record transformation (e.g., decryption).
+ * Called before ValidXLogRecord() and DecodeXLogRecord().
+ * Extension can decrypt or transform the raw record data.
+ * Returns the (possibly modified) XLogRecord to be validated and decoded.
+ *
+ * If inplace_allowed is true, the hook may modify the record in place.
+ * If false, the hook must allocate a new buffer and return it.
+ *
+ * On failure, the hook should either PANIC or return the original record
+ * as fallback.
+ */
+typedef XLogRecord *(*xlog_decode_pre_hook_type) (XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Hook variable for WAL record transformation */
+extern PGDLLIMPORT xlog_decode_pre_hook_type xlog_decode_pre_hook;
+
/*
* Macros that provide access to parts of the record most recently returned by
* XLogReadRecord() or XLogNextRecord().
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index a06833ce0a3..9cfb2aff5ae 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -244,5 +244,10 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#define XLR_BLOCK_ID_TOPLEVEL_XID 252
+/*
+ * I/O transform hook marker. Uses same header format as XLogRecordDataHeaderLong
+ * (1 byte id + 4 bytes length). Use SizeOfXLogRecordDataHeaderLong for size.
+ */
+#define XLR_BLOCK_ID_TRANSFORMED 251
#endif /* XLOGRECORD_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..f18f77d3d22 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -189,7 +189,17 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+/*
+ * Transform ID field (5 bits: values 0-31) for I/O transform extensions.
+ * Value 0 means the page is not transformed (backward compatible).
+ * Values 1-31 are available for extensions to define their own meanings
+ * (e.g., encryption key versions, algorithm identifiers, migration markers).
+ */
+#define PD_TRANSFORM_ID_MASK 0x00F8 /* bits 3-7 */
+#define PD_TRANSFORM_ID_SHIFT 3
+#define PD_TRANSFORM_NONE 0 /* not transformed (core reserved) */
+
+#define PD_VALID_FLAG_BITS 0x00FF /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -441,6 +451,19 @@ PageClearAllVisible(Page page)
((PageHeader) page)->pd_flags &= ~PD_ALL_VISIBLE;
}
+static inline uint8
+PageGetTransformId(const PageData *page)
+{
+ return (((const PageHeaderData *) page)->pd_flags & PD_TRANSFORM_ID_MASK) >> PD_TRANSFORM_ID_SHIFT;
+}
+static inline void
+PageSetTransformId(Page page, uint8 id)
+{
+ ((PageHeader) page)->pd_flags =
+ (((PageHeader) page)->pd_flags & ~PD_TRANSFORM_ID_MASK) |
+ ((id << PD_TRANSFORM_ID_SHIFT) & PD_TRANSFORM_ID_MASK);
+}
+
/*
* These two require "access/transam.h", so left as macros.
*/
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b563c27abf0..0a766a2b61f 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -22,6 +22,71 @@
extern PGDLLIMPORT const PgAioHandleCallbacks aio_md_readv_cb;
+/*
+ * Hook function types for I/O transformation (e.g., encryption/decryption).
+ * These hooks allow extensions to transform data during storage I/O operations.
+ */
+
+/*
+ * Called after blocks are read from disk, before PostgreSQL's checksum verification.
+ * Extension can reverse-transform (e.g., decrypt) the data in place.
+ *
+ * For synchronous reads, called from mdreadv() after read completes.
+ * For AIO reads, called from buffer_readv_complete_one() before PageIsVerified().
+ *
+ * Note: The hook is responsible for verifying on-disk checksum before reverse
+ * transformation and recalculating checksum after transformation. This ensures
+ * data integrity is verified at both stages and PostgreSQL's checksum verification
+ * passes.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors).
+ */
+typedef void (*mdread_post_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdwritev() writes blocks to disk.
+ * Extension can transform (e.g., encrypt) data.
+ * Returns pointer to transformed buffers array (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: The hook should recalculate checksum on transformed data after
+ * transformation. This on-disk checksum will be verified on read before
+ * reverse transformation, ensuring disk-level data integrity.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffers with a WARNING as fallback.
+ */
+typedef const void **(*mdwrite_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdextend() extends a relation with new blocks.
+ * Returns pointer to transformed buffer (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: Same as write hook - the hook should recalculate checksum on
+ * transformed data after transformation.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffer with a WARNING as fallback.
+ */
+typedef const void *(*mdextend_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+
+/* Hook variables for I/O transformation */
+extern PGDLLIMPORT mdread_post_hook_type mdread_post_hook;
+extern PGDLLIMPORT mdwrite_pre_hook_type mdwrite_pre_hook;
+extern PGDLLIMPORT mdextend_pre_hook_type mdextend_pre_hook;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
--
2.50.1 (Apple Git-155)
v20251228-0002-Add-test_tde-extension-for-TDE-testing.patchapplication/x-patch; name=v20251228-0002-Add-test_tde-extension-for-TDE-testing.patchDownload
From f7837456638f37c0555f821822f7a5d113a68cce Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:51:13 +0900
Subject: [PATCH] Add test_tde extension for TDE testing
This extension provides a reference implementation for validating the
Storage I/O Transform Hooks introduced in the previous commit. It uses
AES-256-CTR encryption with IV derived from page metadata (LSN, block
number, relation file number) to ensure uniqueness.
The extension registers hooks for:
- Buffer page read/write transformation (mdread/mdwrite/mdextend)
- WAL record insert and replay transformation
Key features:
- Encryption key configured via test_tde.key GUC (256-bit hex)
- System catalogs and pg_global tablespace excluded from encryption
- Pre-allocated cipher context to avoid allocation in critical sections
- WAL records marked with block ID 251 for encrypted record detection
This is intended for development and testing purposes only, not for
production use. The implementation lacks key rotation, proper key
management, and security auditing.
Author: Henson Choi <assam258@gmail.com>
---
contrib/Makefile | 4 +-
contrib/test_tde/.gitignore | 3 +
contrib/test_tde/Makefile | 27 +
contrib/test_tde/expected/basic.out | 177 +++++
contrib/test_tde/sql/basic.sql | 146 ++++
contrib/test_tde/test_tde.c | 1131 +++++++++++++++++++++++++++
contrib/test_tde/test_tde.conf | 2 +
7 files changed, 1488 insertions(+), 2 deletions(-)
create mode 100644 contrib/test_tde/.gitignore
create mode 100644 contrib/test_tde/Makefile
create mode 100644 contrib/test_tde/expected/basic.out
create mode 100644 contrib/test_tde/sql/basic.sql
create mode 100644 contrib/test_tde/test_tde.c
create mode 100644 contrib/test_tde/test_tde.conf
diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..151eb823850 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -54,9 +54,9 @@ SUBDIRS = \
vacuumlo
ifeq ($(with_ssl),openssl)
-SUBDIRS += pgcrypto sslinfo
+SUBDIRS += pgcrypto sslinfo test_tde
else
-ALWAYS_SUBDIRS += pgcrypto sslinfo
+ALWAYS_SUBDIRS += pgcrypto sslinfo test_tde
endif
ifneq ($(with_uuid),no)
diff --git a/contrib/test_tde/.gitignore b/contrib/test_tde/.gitignore
new file mode 100644
index 00000000000..2ea3752951a
--- /dev/null
+++ b/contrib/test_tde/.gitignore
@@ -0,0 +1,3 @@
+log
+results
+tmp_check
diff --git a/contrib/test_tde/Makefile b/contrib/test_tde/Makefile
new file mode 100644
index 00000000000..b2455d3831e
--- /dev/null
+++ b/contrib/test_tde/Makefile
@@ -0,0 +1,27 @@
+# contrib/test_tde/Makefile
+
+MODULE_big = test_tde
+OBJS = \
+ $(WIN32RES) \
+ test_tde.o
+
+PGFILEDESC = "test_tde - reference implementation for I/O transform hooks"
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_tde/test_tde.conf
+REGRESS = basic
+# Disabled because these tests require "shared_preload_libraries=test_tde"
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_tde
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# OpenSSL is required for encryption
+SHLIB_LINK += $(filter -lcrypto, $(LIBS))
diff --git a/contrib/test_tde/expected/basic.out b/contrib/test_tde/expected/basic.out
new file mode 100644
index 00000000000..9932cf43614
--- /dev/null
+++ b/contrib/test_tde/expected/basic.out
@@ -0,0 +1,177 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+-- Show current settings
+SHOW test_tde.key;
+ test_tde.key
+------------------------------------------------------------------
+ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
+(1 row)
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+ id | secret_data | secret_number
+----+------------------------+---------------
+ 1 | This is secret data | 12345
+ 2 | Another secret message | 67890
+ 3 | PostgreSQL TDE test | 11111
+(3 rows)
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+ id | secret_data | secret_number
+----+----------------+---------------
+ 1 | Updated secret | 12345
+(1 row)
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+ count
+-------
+ 13
+(1 row)
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+ id | secret_data | secret_number
+----+-------------+---------------
+ 14 | |
+(1 row)
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+ secret_data
+----------------
+ Updated secret
+(1 row)
+
+-- Clean up
+DROP TABLE test_encrypt;
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+-----------------
+ 1 | before truncate
+(1 row)
+
+TRUNCATE test_truncate;
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+----------------
+ 2 | after truncate
+(1 row)
+
+DROP TABLE test_truncate;
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+CLUSTER test_cluster USING test_cluster_pkey;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+ count
+-------
+ 100
+(1 row)
+
+SELECT * FROM test_cluster WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+DROP TABLE test_cluster;
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+VACUUM FULL test_vacuum_full;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_vacuum_full;
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+REINDEX INDEX test_reindex_pkey;
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+RESET enable_seqscan;
+DROP TABLE test_reindex;
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/sql/basic.sql b/contrib/test_tde/sql/basic.sql
new file mode 100644
index 00000000000..9b2651afee8
--- /dev/null
+++ b/contrib/test_tde/sql/basic.sql
@@ -0,0 +1,146 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+
+-- Show current settings
+SHOW test_tde.key;
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+
+-- Clean up
+DROP TABLE test_encrypt;
+
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+
+TRUNCATE test_truncate;
+
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+
+DROP TABLE test_truncate;
+
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+CLUSTER test_cluster USING test_cluster_pkey;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+SELECT * FROM test_cluster WHERE id = 50;
+
+DROP TABLE test_cluster;
+
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+
+VACUUM FULL test_vacuum_full;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+
+DROP TABLE test_vacuum_full;
+
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+REINDEX INDEX test_reindex_pkey;
+
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+RESET enable_seqscan;
+
+DROP TABLE test_reindex;
+
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/test_tde.c b/contrib/test_tde/test_tde.c
new file mode 100644
index 00000000000..f70359f1c26
--- /dev/null
+++ b/contrib/test_tde/test_tde.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_tde.c
+ * Reference implementation for Storage I/O Transform Hooks
+ *
+ * WARNING: This is for TESTING ONLY. Do not use in production.
+ * - Key stored in plaintext GUC
+ * - No key rotation
+ * - Minimal error handling
+ * - Not audited for security
+ *
+ * For production TDE, use a dedicated extension project.
+ *
+ * This extension demonstrates how to use the storage I/O transform hooks
+ * for transparent data encryption. It uses AES-256-CTR for encryption
+ * with IV derived from page metadata and block location.
+ *
+ * Author: Henson Choi <assam258@gmail.com>
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/test_tde/test_tde.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <openssl/err.h>
+#include <openssl/evp.h>
+#include <string.h>
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "catalog/pg_tablespace_d.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "storage/md.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_tde",
+ .version = PG_VERSION
+);
+
+/* ----------
+ * GUC variables
+ * ----------
+ */
+static char *test_tde_key_hex = NULL; /* 64 hex chars = 256 bits */
+
+/* ----------
+ * Module state
+ * ----------
+ */
+
+/*
+ * Memory context for encryption buffers.
+ * Allows allocation in critical sections (for WAL encryption).
+ */
+static MemoryContext test_tde_cxt = NULL;
+
+/*
+ * Transform ID for this extension.
+ * Value 1 means page is encrypted with test_tde.
+ * Value 0 means page is not transformed (plaintext).
+ */
+#define TEST_TDE_TRANSFORM_ID 1
+
+/*
+ * Dynamic buffers for encrypted pages.
+ * Grows as needed, freed in _PG_fini.
+ */
+static char *encrypt_buffer = NULL;
+static const void **encrypt_buffer_ptrs = NULL;
+static BlockNumber encrypt_buffer_nblocks = 0;
+
+/*
+ * WAL encryption buffer - allocated from test_tde_cxt which allows
+ * allocation in critical sections via MemoryContextAllowInCriticalSection().
+ */
+static char *wal_encrypt_buffer = NULL;
+static Size wal_encrypt_buffer_size = 0;
+
+/*
+ * WAL decryption buffer - static, only needed for records within a single page.
+ * When inplace_allowed=false, record doesn't cross page boundary, so max size
+ * is XLOG_BLCKSZ.
+ */
+static char wal_decrypt_buffer[XLOG_BLCKSZ];
+
+/*
+ * Pre-allocated OpenSSL cipher context.
+ * Created in _PG_init() and reused for all encrypt/decrypt operations.
+ * This avoids memory allocation in critical sections.
+ */
+static EVP_CIPHER_CTX *cipher_ctx = NULL;
+
+/*
+ * Transformed WAL record structure (using XLR_BLOCK_ID_TRANSFORMED from xlogrecord.h):
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV (16B)]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as transformed. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ *
+ * Note: The 21-byte overhead may temporarily cause xl_tot_len to exceed
+ * XLogRecordMaxSize after encryption. This is safe because:
+ * - XLogRecordMaxSize is only checked in XLogRecordAssemble() before our hook
+ * - XLogInsertRecord() does not re-validate the size
+ * - The decode hook removes the overhead before WAL parsing, restoring the
+ * original size which was already validated
+ */
+#define WAL_ENCRYPT_IV_SIZE 16
+#define WAL_ENCRYPT_OVERHEAD (SizeOfXLogRecordDataHeaderLong + WAL_ENCRYPT_IV_SIZE)
+#define WAL_CRC_SIZE sizeof(pg_crc32c) /* 4 bytes */
+#define WAL_IV_RANDOM_SIZE (WAL_ENCRYPT_IV_SIZE - WAL_CRC_SIZE) /* 12 bytes */
+
+/* Static XLogRecData for returning encrypted WAL */
+static XLogRecData wal_rdata_head;
+
+/* Previous hook values (for chaining) */
+static mdread_post_hook_type prev_mdread_post_hook = NULL;
+static mdwrite_pre_hook_type prev_mdwrite_pre_hook = NULL;
+static mdextend_pre_hook_type prev_mdextend_pre_hook = NULL;
+static xlog_insert_pre_hook_type prev_xlog_insert_pre_hook = NULL;
+static xlog_decode_pre_hook_type prev_xlog_decode_pre_hook = NULL;
+
+/* ----------
+ * Function declarations
+ * ----------
+ */
+
+/* Module entry points */
+void _PG_init(void);
+void _PG_fini(void);
+
+/* GUC callbacks */
+static bool check_test_tde_key(char **newval, void **extra, GucSource source);
+
+/* Hook functions */
+static void test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks);
+static const void **test_tde_mdwrite_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+static const void *test_tde_mdextend_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+static struct XLogRecData *test_tde_xlog_insert_pre(struct XLogRecData *rdata);
+static XLogRecord *test_tde_xlog_decode_pre(XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Internal helper functions */
+static void ensure_encrypt_buffer(BlockNumber nblocks);
+static bool parse_hex_key(const char *hex, unsigned char *out, int outlen);
+static void derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn);
+static void transform_data(const unsigned char *in, unsigned char *out,
+ int len, const unsigned char *iv);
+static bool should_transform(RelFileLocator *rlocator, ForkNumber forknum);
+
+
+/* ----------
+ * Internal helper functions
+ * ----------
+ */
+
+/*
+ * Parse hex string to bytes
+ */
+static bool
+parse_hex_key(const char *hex, unsigned char *out, int outlen)
+{
+ int i;
+ int hexlen;
+
+ if (hex == NULL)
+ return false;
+
+ hexlen = strlen(hex);
+ if (hexlen != outlen * 2)
+ return false;
+
+ for (i = 0; i < outlen; i++)
+ {
+ int hi,
+ lo;
+ char c;
+
+ c = hex[i * 2];
+ if (c >= '0' && c <= '9')
+ hi = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ hi = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ hi = c - 'A' + 10;
+ else
+ return false;
+
+ c = hex[i * 2 + 1];
+ if (c >= '0' && c <= '9')
+ lo = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ lo = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ lo = c - 'A' + 10;
+ else
+ return false;
+
+ out[i] = (hi << 4) | lo;
+ }
+
+ return true;
+}
+
+/*
+ * Ensure encrypt buffer can hold 'nblocks' pages.
+ * Grows by 2x when needed. Uses test_tde_cxt for persistence.
+ */
+static void
+ensure_encrypt_buffer(BlockNumber nblocks)
+{
+ if (encrypt_buffer_nblocks >= nblocks)
+ return;
+
+ if (encrypt_buffer == NULL)
+ {
+ BlockNumber initial = Max(8, nblocks);
+ Size size = (Size) initial * BLCKSZ;
+
+ encrypt_buffer = MemoryContextAllocAligned(test_tde_cxt, size,
+ PG_IO_ALIGN_SIZE, 0);
+ encrypt_buffer_ptrs = MemoryContextAlloc(test_tde_cxt,
+ initial * sizeof(void *));
+ encrypt_buffer_nblocks = initial;
+ }
+ else
+ {
+ BlockNumber new_nblocks = encrypt_buffer_nblocks;
+ Size new_size;
+
+ while (new_nblocks < nblocks)
+ new_nblocks *= 2;
+
+ new_size = (Size) new_nblocks * BLCKSZ;
+
+ /* repalloc doesn't preserve alignment, so allocate new and copy */
+ {
+ char *new_buffer = MemoryContextAllocAligned(test_tde_cxt,
+ new_size,
+ PG_IO_ALIGN_SIZE, 0);
+
+ memcpy(new_buffer, encrypt_buffer,
+ (Size) encrypt_buffer_nblocks * BLCKSZ);
+ pfree(encrypt_buffer);
+ encrypt_buffer = new_buffer;
+ }
+
+ encrypt_buffer_ptrs = repalloc(encrypt_buffer_ptrs,
+ new_nblocks * sizeof(void *));
+ encrypt_buffer_nblocks = new_nblocks;
+ }
+
+ /* Update pointers array */
+ for (BlockNumber i = 0; i < encrypt_buffer_nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_buffer + (Size) i * BLCKSZ;
+}
+
+
+/*
+ * Derive IV from page location and header
+ *
+ * IV structure (16 bytes) - simple, deterministic layout:
+ *
+ * AES-CTR mode only requires IV uniqueness, not randomness.
+ * The combination of LSN + RelFileNumber + BlockNumber guarantees uniqueness:
+ * - LSN: Globally unique across entire WAL stream
+ * - RelFileNumber: Unique within database
+ * - BlockNumber: Unique within relation
+ *
+ * Even when a single WAL record modifies multiple pages (e.g., B-tree split),
+ * the BlockNumber distinguishes each page.
+ *
+ * Layout (high entropy bytes first, low entropy bytes last for CTR counter space):
+ * [0-3] LSN low 32 bits - changes frequently (high entropy)
+ * [4-5] LSN bits 32-47 - mid entropy
+ * [6-8] BlockNumber low 24 bits
+ * [9-11] RelFileNumber low 24 bits
+ * [12] BlockNumber high 8 bits - usually 0 for small tables
+ * [13] RelFileNumber high 8 bits - usually 0
+ * [14-15] LSN bits 48-63 - usually 0, counter space for CTR
+ *
+ * CTR counter space analysis:
+ * - Page size: 8KB, encrypted area: 8168 bytes (excluding 24-byte header)
+ * - AES block size: 16 bytes
+ * - Counter increments per page: 8168/16 = 511 (0x1FF)
+ * - Counter affects only IV[14-15] (max increment 0x1FF < 0x10000)
+ * - Bytes 12-15 provide 2^32 counter space, far exceeding 511 needed
+ * - Collision requires same IV[0-11], which means same LSN+BlockNum+RelNum
+ *
+ * Note: spcOid, dbOid not used - RelFileNumber is sufficient for uniqueness.
+ *
+ * Known limitation: Operations that copy/move files while changing
+ * RelFileNumber without going through storage hooks cause decryption failure.
+ */
+static void
+derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn)
+{
+
+ /*
+ * Layout: High entropy first, low entropy (usually 0) last.
+ * [LSN low 4B][LSN mid 2B][BlockNum low 3B][RelNum low 3B]
+ * [BlockNum high 1B][RelNum high 1B][LSN high 2B]
+ */
+
+ /* LSN low 32 bits - bytes 0-3 (high entropy, changes frequently) */
+ iv[0] = (uint8) ((lsn >> 0) & 0xFF);
+ iv[1] = (uint8) ((lsn >> 8) & 0xFF);
+ iv[2] = (uint8) ((lsn >> 16) & 0xFF);
+ iv[3] = (uint8) ((lsn >> 24) & 0xFF);
+
+ /* LSN bits 32-47 - bytes 4-5 (mid entropy) */
+ iv[4] = (uint8) ((lsn >> 32) & 0xFF);
+ iv[5] = (uint8) ((lsn >> 40) & 0xFF);
+
+ /* BlockNumber low 24 bits - bytes 6-8 */
+ iv[6] = (uint8) ((blocknum >> 0) & 0xFF);
+ iv[7] = (uint8) ((blocknum >> 8) & 0xFF);
+ iv[8] = (uint8) ((blocknum >> 16) & 0xFF);
+
+ /* RelFileNumber low 24 bits - bytes 9-11 */
+ iv[9] = (uint8) ((rlocator->relNumber >> 0) & 0xFF);
+ iv[10] = (uint8) ((rlocator->relNumber >> 8) & 0xFF);
+ iv[11] = (uint8) ((rlocator->relNumber >> 16) & 0xFF);
+
+ /* BlockNumber high 8 bits - byte 12 (usually 0 for small tables) */
+ iv[12] = (uint8) ((blocknum >> 24) & 0xFF);
+
+ /* RelFileNumber high 8 bits - byte 13 (usually 0) */
+ iv[13] = (uint8) ((rlocator->relNumber >> 24) & 0xFF);
+
+ /* LSN bits 48-63 - bytes 14-15 (usually 0, counter space for CTR) */
+ iv[14] = (uint8) ((lsn >> 48) & 0xFF);
+ iv[15] = (uint8) ((lsn >> 56) & 0xFF);
+}
+
+/*
+ * Encrypt or decrypt data using AES-256-CTR
+ *
+ * AES-CTR is symmetric: encrypt and decrypt use the same operation.
+ */
+static void
+transform_data(const unsigned char *in, unsigned char *out, int len,
+ const unsigned char *iv)
+{
+ int outlen,
+ tmplen;
+
+ if (len <= 0)
+ return;
+
+ /*
+ * cipher_ctx is pre-allocated and initialized with cipher/key in _PG_init().
+ * Here we only set IV (cipher=NULL, key=NULL), which avoids internal
+ * memory allocation. This is critical for WAL encryption which runs
+ * inside critical sections. We use PANIC for all errors.
+ */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: cipher context not initialized")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, NULL, NULL, NULL, iv) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptInit_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptUpdate(cipher_ctx, out, &outlen, in, len) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptUpdate failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptFinal_ex(cipher_ctx, out + outlen, &tmplen) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptFinal_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+}
+
+/*
+ * Check if we should encrypt/decrypt this relation
+ *
+ * For this test implementation, we encrypt only user-created relations.
+ * A production implementation would check encryption policies.
+ */
+static bool
+should_transform(RelFileLocator *rlocator, ForkNumber forknum)
+{
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return false;
+
+ /* Skip system catalog tablespace (pg_global) */
+ if (rlocator->spcOid == GLOBALTABLESPACE_OID)
+ return false;
+
+ /*
+ * Skip system catalogs (OID < FirstNormalObjectId). This ensures we don't
+ * try to encrypt/decrypt pre-existing system catalog pages that were
+ * created without encryption.
+ */
+ if (rlocator->relNumber < FirstNormalObjectId)
+ return false;
+
+ (void) forknum; /* all forks are encrypted for user tables */
+
+ return true;
+}
+
+
+/* ----------
+ * Hook functions - Page I/O
+ * ----------
+ */
+
+/*
+ * Post-read hook: decrypt blocks after reading from disk
+ */
+static void
+test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+ unsigned char iv[16];
+
+ /* Chain to previous hook if any */
+ if (prev_mdread_post_hook)
+ prev_mdread_post_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ {
+ PageHeader phdr = (PageHeader) buffers[i];
+ uint16 checksum;
+ uint8 transform_id;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffers[i]))
+ continue;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ continue;
+
+ /* Check transform ID - skip if page is not encrypted by us */
+ transform_id = PageGetTransformId((Page) buffers[i]);
+ if (transform_id == PD_TRANSFORM_NONE)
+ continue; /* Page is not encrypted */
+
+ if (transform_id != TEST_TDE_TRANSFORM_ID)
+ {
+ elog(DEBUG1, "test_tde: skipping block %u with transform ID %u (not ours)",
+ blocknum + i, transform_id);
+ continue;
+ }
+
+ /* Page is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted page found but encryption key not configured"),
+ errdetail("Block %u of relation %u/%u/%u fork %d has transform ID %u.",
+ blocknum + i, rlocator->spcOid, rlocator->dbOid,
+ rlocator->relNumber, forknum, transform_id)));
+
+ /* Verify checksum on encrypted data before decryption */
+ if (DataChecksumsEnabled())
+ {
+ checksum = pg_checksum_page((char *) buffers[i], blocknum + i);
+ if (checksum != phdr->pd_checksum)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("page verification failed, calculated checksum %u but expected %u",
+ checksum, phdr->pd_checksum)));
+ }
+ }
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum + i, PageGetLSN((Page) buffers[i]));
+
+ /* Decrypt data area in place (header stays unchanged) */
+ transform_data((unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ (unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Clear transform ID and recalculate checksum for plaintext data */
+ PageSetTransformId((Page) buffers[i], PD_TRANSFORM_NONE);
+ PageSetChecksumInplace((Page) buffers[i], blocknum + i);
+ }
+}
+
+/*
+ * Helper: encrypt a single page into the encrypt_buffer at given offset.
+ * Returns pointer to encrypted page, or original buffer if page was skipped.
+ */
+static const void *
+encrypt_page(RelFileLocator *rlocator, BlockNumber blocknum,
+ const void *buffer, Size buffer_offset)
+{
+ unsigned char iv[16];
+ PageHeader phdr = (PageHeader) buffer;
+ char *dest = encrypt_buffer + buffer_offset;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffer))
+ return buffer;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ return buffer;
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum, PageGetLSN((Page) buffer));
+
+ /* Copy header, encrypt data area */
+ memcpy(dest, buffer, SizeOfPageHeaderData);
+ transform_data((unsigned char *) buffer + SizeOfPageHeaderData,
+ (unsigned char *) dest + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Set transform ID to mark page as encrypted */
+ PageSetTransformId((Page) dest, TEST_TDE_TRANSFORM_ID);
+
+ /* Recalculate checksum for encrypted data */
+ PageSetChecksumInplace((Page) dest, blocknum);
+
+ return dest;
+}
+
+/*
+ * Pre-write hook: encrypt blocks before writing to disk
+ */
+static const void **
+test_tde_mdwrite_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+
+ /* Chain to previous hook if any */
+ if (prev_mdwrite_pre_hook)
+ buffers = prev_mdwrite_pre_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ if (!should_transform(rlocator, forknum))
+ return buffers;
+
+ /* Ensure buffer is large enough */
+ ensure_encrypt_buffer(nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_page(rlocator, blocknum + i,
+ buffers[i], (Size) i * BLCKSZ);
+
+ return encrypt_buffer_ptrs;
+}
+
+/*
+ * Pre-extend hook: encrypt block before extending relation
+ */
+static const void *
+test_tde_mdextend_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void *buffer)
+{
+ /* Chain to previous hook if any */
+ if (prev_mdextend_pre_hook)
+ buffer = prev_mdextend_pre_hook(rlocator, forknum, blocknum, buffer);
+
+ if (!should_transform(rlocator, forknum))
+ return buffer;
+
+ /* Ensure buffer is large enough for at least 1 block */
+ ensure_encrypt_buffer(1);
+
+ return encrypt_page(rlocator, blocknum, buffer, 0);
+}
+
+
+/* ----------
+ * Hook functions - WAL I/O
+ * ----------
+ */
+
+/*
+ * Ensure WAL encryption buffer is large enough.
+ * Uses test_tde_cxt which allows allocation in critical sections.
+ */
+static void
+ensure_wal_encrypt_buffer(Size needed)
+{
+ if (wal_encrypt_buffer_size >= needed)
+ return;
+
+ if (wal_encrypt_buffer == NULL)
+ wal_encrypt_buffer = MemoryContextAlloc(test_tde_cxt, needed);
+ else
+ wal_encrypt_buffer = repalloc(wal_encrypt_buffer, needed);
+ wal_encrypt_buffer_size = needed;
+}
+
+/*
+ * WAL insert pre-hook: encrypt WAL record data
+ *
+ * Strategy:
+ * 1. Copy XLogRecord header and payload
+ * 2. Save plaintext CRC from header (xl_crc contains payload CRC at this point)
+ * 3. Build IV: [plaintext CRC (4B)] [random (12B)]
+ * 4. Insert transformation header (block ID 251 + payload_length) and IV
+ * 5. Encrypt original payload with the IV
+ * 6. Update xl_tot_len and recalculate CRC for encrypted payload
+ *
+ * Resulting record structure:
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV 16B]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as encrypted. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ */
+static struct XLogRecData *
+test_tde_xlog_insert_pre(struct XLogRecData *rdata)
+{
+ XLogRecData *node;
+ XLogRecord *rechdr;
+ char *bufptr;
+ char *new_payload_start;
+ uint32 orig_total_len;
+ uint32 orig_payload_len;
+ uint32 new_total_len;
+ uint32 transform_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ pg_crc32c plaintext_crc;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_insert_pre_hook)
+ rdata = prev_xlog_insert_pre_hook(rdata);
+
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return rdata;
+
+ /* First node must contain XLogRecord header */
+ if (rdata == NULL || rdata->data == NULL || rdata->len < SizeOfXLogRecord)
+ return rdata;
+
+ rechdr = (XLogRecord *) rdata->data;
+ orig_total_len = rechdr->xl_tot_len;
+ orig_payload_len = orig_total_len - SizeOfXLogRecord;
+
+ /* Sanity check */
+ if (orig_total_len < SizeOfXLogRecord)
+ return rdata;
+
+ /*
+ * Skip records with no payload (e.g., XLOG_SWITCH). These are header-only
+ * records where adding encryption overhead would break size assertions.
+ */
+ if (orig_payload_len == 0)
+ return rdata;
+
+ new_total_len = orig_total_len + WAL_ENCRYPT_OVERHEAD;
+
+ /*
+ * Save plaintext CRC before we modify anything.
+ * At this point, xl_crc contains the CRC of the payload only
+ * (header CRC is added later by XLogInsertRecord).
+ */
+ plaintext_crc = rechdr->xl_crc;
+
+ /*
+ * Ensure buffer is large enough. test_tde_cxt allows allocation in
+ * critical sections, so this is safe even during WAL insertion.
+ * OOM here will cause PANIC, which is acceptable for critical sections.
+ */
+ ensure_wal_encrypt_buffer(new_total_len);
+
+ /*
+ * Build IV: [plaintext CRC (4B)] [random (12B)]
+ * Store CRC directly in IV[0..3] (little-endian).
+ */
+ iv[0] = ((uint32) plaintext_crc >> 0) & 0xFF;
+ iv[1] = ((uint32) plaintext_crc >> 8) & 0xFF;
+ iv[2] = ((uint32) plaintext_crc >> 16) & 0xFF;
+ iv[3] = ((uint32) plaintext_crc >> 24) & 0xFF;
+
+ /* Generate random bytes for IV[4..15] (12 bytes) for uniqueness */
+ if (!pg_strong_random(iv + WAL_CRC_SIZE, WAL_IV_RANDOM_SIZE))
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: failed to generate random IV for WAL")));
+ return rdata;
+ }
+
+ /*
+ * Build encrypted record in buffer:
+ * [header][block_id][payload_length][IV][encrypted_payload]
+ */
+ bufptr = wal_encrypt_buffer;
+
+ /* 1. Copy header from first rdata node */
+ memcpy(bufptr, rdata->data, SizeOfXLogRecord);
+ bufptr += SizeOfXLogRecord;
+
+ /* 2. Insert transformation header (block ID 251 + payload_length) */
+ new_payload_start = bufptr;
+ *bufptr = (char) XLR_BLOCK_ID_TRANSFORMED;
+ bufptr += sizeof(uint8);
+
+ /* Calculate payload_length: IV + encrypted payload */
+ transform_payload_len = WAL_ENCRYPT_IV_SIZE + orig_payload_len;
+
+ /* Store payload_length (4 bytes, unaligned, little-endian) */
+ bufptr[0] = (char) ((transform_payload_len >> 0) & 0xFF);
+ bufptr[1] = (char) ((transform_payload_len >> 8) & 0xFF);
+ bufptr[2] = (char) ((transform_payload_len >> 16) & 0xFF);
+ bufptr[3] = (char) ((transform_payload_len >> 24) & 0xFF);
+ bufptr += sizeof(uint32);
+
+ /* 3. Insert IV (CRC in first 4 bytes, random in remaining 12) */
+ memcpy(bufptr, iv, WAL_ENCRYPT_IV_SIZE);
+ bufptr += WAL_ENCRYPT_IV_SIZE;
+
+ /* 4. Copy payload to buffer, then encrypt in-place */
+ if (orig_payload_len > 0)
+ {
+ Size first_node_payload;
+ char *encrypt_start = bufptr;
+
+ /* First node: skip header, copy remaining payload */
+ first_node_payload = rdata->len - SizeOfXLogRecord;
+ if (first_node_payload > 0)
+ {
+ memcpy(bufptr, (char *) rdata->data + SizeOfXLogRecord, first_node_payload);
+ bufptr += first_node_payload;
+ }
+
+ /* Remaining nodes: copy all data */
+ for (node = rdata->next; node != NULL; node = node->next)
+ {
+ if (node->len > 0 && node->data != NULL)
+ {
+ memcpy(bufptr, node->data, node->len);
+ bufptr += node->len;
+ }
+ }
+
+ /* Encrypt payload in-place */
+ transform_data((unsigned char *) encrypt_start,
+ (unsigned char *) encrypt_start,
+ orig_payload_len, iv);
+ }
+
+ /* Update header with new total length */
+ rechdr = (XLogRecord *) wal_encrypt_buffer;
+ rechdr->xl_tot_len = new_total_len;
+
+ /*
+ * Recalculate CRC for the new payload (marker + length + IV + encrypted data).
+ * The header CRC will be added by XLogInsertRecord later.
+ */
+ {
+ pg_crc32c crc;
+
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, new_payload_start, new_total_len - SizeOfXLogRecord);
+ rechdr->xl_crc = crc;
+ }
+
+ /* Return single XLogRecData pointing to our encrypted buffer */
+ wal_rdata_head.next = NULL;
+ wal_rdata_head.data = wal_encrypt_buffer;
+ wal_rdata_head.len = new_total_len;
+
+ return &wal_rdata_head;
+}
+
+/*
+ * WAL decode pre-hook: decrypt WAL record data
+ *
+ * This reverses the encryption done in xlog_insert_pre_hook.
+ * Checks for block ID 251 marker to identify encrypted records.
+ *
+ * Input: [header] [block_id=251 (1B)] [payload_length (4B)] [IV 16B] [encrypted payload]
+ * Output: [header] [original payload] (shorter by 21 bytes)
+ *
+ * Recovery process:
+ * 1. Check for encryption marker (block ID 251)
+ * 2. Read payload_length from transform header
+ * 3. Extract IV for decryption
+ * 4. Decrypt payload using IV
+ * 5. Extract plaintext payload CRC from IV[0..3]
+ * 6. Restore original record structure
+ *
+ * If the marker is not found, record is not encrypted (pass through).
+ * If inplace_allowed, decrypts in place. Otherwise, copies to static buffer.
+ */
+static XLogRecord *
+test_tde_xlog_decode_pre(XLogReaderState *state, XLogRecord *record,
+ XLogRecPtr lsn, bool inplace_allowed)
+{
+ uint32 total_len;
+ uint32 transform_payload_len;
+ uint32 encrypted_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ char *payload_start;
+ char *len_ptr;
+ XLogRecord *work_record;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_decode_pre_hook)
+ record = prev_xlog_decode_pre_hook(state, record, lsn, inplace_allowed);
+
+ if (record == NULL)
+ return record;
+
+ total_len = record->xl_tot_len;
+
+ /* Must have at least header + transform header + IV */
+ if (total_len < SizeOfXLogRecord + WAL_ENCRYPT_OVERHEAD)
+ return record;
+
+ /* Check for transformation marker (block ID 251) */
+ payload_start = (char *) record + SizeOfXLogRecord;
+ if ((unsigned char) *payload_start != XLR_BLOCK_ID_TRANSFORMED)
+ return record; /* Not transformed, pass through */
+
+ /* WAL is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted WAL record found but encryption key not configured"),
+ errdetail("WAL record at LSN %X/%X has transformation marker.",
+ LSN_FORMAT_ARGS(lsn))));
+
+ /*
+ * If inplace modification allowed, work directly on record. Otherwise,
+ * copy to static buffer (record fits in single page).
+ */
+ if (inplace_allowed)
+ {
+ work_record = record;
+ }
+ else
+ {
+ /* Record within single page, must fit in XLOG_BLCKSZ */
+ if (total_len > XLOG_BLCKSZ)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: WAL record too large for decryption buffer")));
+ return record;
+ }
+ memcpy(wal_decrypt_buffer, record, total_len);
+ work_record = (XLogRecord *) wal_decrypt_buffer;
+ }
+
+ /* Recalculate payload_start for work_record */
+ payload_start = (char *) work_record + SizeOfXLogRecord;
+
+ /* Read payload_length from transform header (4 bytes, unaligned, little-endian) */
+ len_ptr = payload_start + sizeof(uint8);
+ transform_payload_len = ((uint32) (unsigned char) len_ptr[0] << 0) |
+ ((uint32) (unsigned char) len_ptr[1] << 8) |
+ ((uint32) (unsigned char) len_ptr[2] << 16) |
+ ((uint32) (unsigned char) len_ptr[3] << 24);
+
+ /* Validate payload_length */
+ if (transform_payload_len < WAL_ENCRYPT_IV_SIZE ||
+ transform_payload_len > total_len - SizeOfXLogRecord - SizeOfXLogRecordDataHeaderLong)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: invalid transform payload length %u at LSN %X/%X",
+ transform_payload_len, LSN_FORMAT_ARGS(lsn))));
+ return record;
+ }
+
+ /* Extract IV (after transform header) */
+ memcpy(iv, payload_start + SizeOfXLogRecordDataHeaderLong, WAL_ENCRYPT_IV_SIZE);
+
+ /* Encrypted payload length = transform_payload_len - IV */
+ encrypted_payload_len = transform_payload_len - WAL_ENCRYPT_IV_SIZE;
+
+ /*
+ * Decrypt payload directly to payload_start position, removing header and IV.
+ * Source: payload_start + 21 (encrypted data after transform header + IV)
+ * Dest: payload_start (overwrite transform header with decrypted data)
+ */
+ if (encrypted_payload_len > 0)
+ {
+ transform_data((unsigned char *) (payload_start + WAL_ENCRYPT_OVERHEAD),
+ (unsigned char *) payload_start,
+ encrypted_payload_len, iv);
+ }
+
+ /* Update header with original length (transform header and IV removed) */
+ work_record->xl_tot_len = SizeOfXLogRecord + encrypted_payload_len;
+
+ /*
+ * Recover plaintext payload CRC from IV[0..3] (little-endian).
+ */
+ {
+ pg_crc32c recovered_payload_crc;
+ pg_crc32c full_crc;
+
+ /* Extract CRC directly from IV[0..3] */
+ recovered_payload_crc = (pg_crc32c) (((uint32) iv[0] << 0) |
+ ((uint32) iv[1] << 8) |
+ ((uint32) iv[2] << 16) |
+ ((uint32) iv[3] << 24));
+
+ /*
+ * For ValidXLogRecord(), we need CRC of: payload + header (up to xl_crc)
+ * The recovered CRC is payload-only, so add header portion.
+ */
+ full_crc = recovered_payload_crc;
+ COMP_CRC32C(full_crc, (char *) work_record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32C(full_crc);
+ work_record->xl_crc = full_crc;
+ }
+
+ return work_record;
+}
+
+
+/* ----------
+ * GUC callbacks
+ * ----------
+ */
+
+/*
+ * GUC check hook for key
+ */
+static bool
+check_test_tde_key(char **newval, void **extra, GucSource source)
+{
+ if (*newval == NULL || strlen(*newval) == 0)
+ return true;
+
+ if (strlen(*newval) != 64)
+ {
+ GUC_check_errdetail("Key must be exactly 64 hex characters (256 bits).");
+ return false;
+ }
+
+ /* Validate hex characters */
+ for (int i = 0; i < 64; i++)
+ {
+ char c = (*newval)[i];
+
+ if (!((c >= '0' && c <= '9') ||
+ (c >= 'a' && c <= 'f') ||
+ (c >= 'A' && c <= 'F')))
+ {
+ GUC_check_errdetail("Key must contain only hex characters (0-9, a-f, A-F).");
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/* ----------
+ * Module entry points
+ * ----------
+ */
+
+/*
+ * Module initialization
+ */
+void
+_PG_init(void)
+{
+ unsigned char key[32];
+
+ /*
+ * Create memory context for encryption buffers and allow allocation
+ * in critical sections. This is necessary because WAL encryption runs
+ * inside critical sections, and OOM there will cause PANIC anyway.
+ */
+ test_tde_cxt = AllocSetContextCreate(TopMemoryContext,
+ "test_tde",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextAllowInCriticalSection(test_tde_cxt, true);
+
+ /*
+ * Define GUC for encryption key.
+ *
+ * PGC_POSTMASTER: Key can only be set at server start to prevent
+ * accidental runtime changes.
+ *
+ * WARNING: Once data is encrypted with a key, that same key MUST be used
+ * for the lifetime of the data. Changing the key (even across restarts)
+ * will cause decryption failures and data corruption. This reference
+ * implementation does not support key rotation.
+ */
+ DefineCustomStringVariable("test_tde.key",
+ "Encryption key in hex format (64 characters = 256 bits).",
+ "WARNING: Key must never change once data is encrypted!",
+ &test_tde_key_hex,
+ "",
+ PGC_POSTMASTER,
+ GUC_SUPERUSER_ONLY,
+ check_test_tde_key,
+ NULL,
+ NULL);
+
+ MarkGUCPrefixReserved("test_tde");
+
+ /*
+ * Parse key and initialize cipher context if key is configured.
+ * cipher_ctx remains NULL if no key is set, disabling encryption.
+ */
+ if (test_tde_key_hex != NULL && strlen(test_tde_key_hex) == 64)
+ {
+ if (!parse_hex_key(test_tde_key_hex, key, 32))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("test_tde: failed to parse encryption key")));
+
+ cipher_ctx = EVP_CIPHER_CTX_new();
+ if (!cipher_ctx)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("test_tde: failed to create cipher context")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, EVP_aes_256_ctr(), NULL, key, NULL) != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: failed to initialize cipher context")));
+
+ /* Clear key from stack */
+ explicit_bzero(key, sizeof(key));
+ }
+
+ /* Install hooks (save previous values for chaining) */
+ prev_mdread_post_hook = mdread_post_hook;
+ mdread_post_hook = test_tde_mdread_post;
+
+ prev_mdwrite_pre_hook = mdwrite_pre_hook;
+ mdwrite_pre_hook = test_tde_mdwrite_pre;
+
+ prev_mdextend_pre_hook = mdextend_pre_hook;
+ mdextend_pre_hook = test_tde_mdextend_pre;
+
+ prev_xlog_insert_pre_hook = xlog_insert_pre_hook;
+ xlog_insert_pre_hook = test_tde_xlog_insert_pre;
+
+ prev_xlog_decode_pre_hook = xlog_decode_pre_hook;
+ xlog_decode_pre_hook = test_tde_xlog_decode_pre;
+
+ ereport(LOG,
+ (errmsg("test_tde: initialized (WARNING: for testing only!)")));
+}
+
+/*
+ * Module finalization
+ */
+void
+_PG_fini(void)
+{
+ /* Restore previous hooks */
+ xlog_decode_pre_hook = prev_xlog_decode_pre_hook;
+ xlog_insert_pre_hook = prev_xlog_insert_pre_hook;
+ mdextend_pre_hook = prev_mdextend_pre_hook;
+ mdwrite_pre_hook = prev_mdwrite_pre_hook;
+ mdread_post_hook = prev_mdread_post_hook;
+
+ /* Free OpenSSL cipher context (also clears key material) */
+ if (cipher_ctx != NULL)
+ {
+ EVP_CIPHER_CTX_free(cipher_ctx);
+ cipher_ctx = NULL;
+ }
+
+ /*
+ * Delete memory context - this frees all buffers allocated from it
+ * (encrypt_buffer, encrypt_buffer_ptrs, wal_encrypt_buffer).
+ */
+ if (test_tde_cxt != NULL)
+ {
+ MemoryContextDelete(test_tde_cxt);
+ test_tde_cxt = NULL;
+ }
+
+ /* Reset buffer pointers */
+ encrypt_buffer = NULL;
+ encrypt_buffer_ptrs = NULL;
+ encrypt_buffer_nblocks = 0;
+ wal_encrypt_buffer = NULL;
+ wal_encrypt_buffer_size = 0;
+}
diff --git a/contrib/test_tde/test_tde.conf b/contrib/test_tde/test_tde.conf
new file mode 100644
index 00000000000..0b00366474c
--- /dev/null
+++ b/contrib/test_tde/test_tde.conf
@@ -0,0 +1,2 @@
+shared_preload_libraries = 'test_tde'
+test_tde.key = '0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef'
--
2.50.1 (Apple Git-155)
Updated patches with meson build support:
v2:
- Added meson.build for test_tde extension
- Added test_tde to contrib/meson.build
Regards,
Henson Choi
2025년 12월 28일 (일) PM 6:47, Henson Choi <assam258@gmail.com>님이 작성:
Show quoted text
Hello,
Following up on the RFC, I am submitting the initial patch set for the
proposed infrastructure. These patches introduce a minimal hook-based
protocol to allow extensions to handle data transformation, such as TDE,
while keeping the PostgreSQL core independent of specific cryptographic
implementations.Implementation Details:
Hook Points in Storage I/O Path
The patch introduces five strategic hook points:mdread_post_hook: Called after blocks are read from disk. The extension
can reverse-transform data in place.mdwrite_pre_hook & mdextend_pre_hook: Called before writing or extending
blocks. These hooks return a pointer to transformed buffers.xlog_insert_pre_hook & xlog_decode_pre_hook: Handle transformation for WAL
records during insertion and replay.Data Integrity and Checksum Protocol
To ensure robust error detection, the hooks follow a specific verification
protocol:On Write: The extension transforms the page, sets the Transform ID, then
recalculates the checksum on the transformed data.On Read: The extension verifies the on-disk checksum of the transformed
data first. After reverse-transformation, it clears the Transform ID and
recalculates the checksum for the plaintext data. This ensures corruption
is detected regardless of the transformation state.WAL Safety via XLR_BLOCK_ID_TRANSFORMED (251)
For WAL records, I have introduced a specific block ID (251) to mark
transformed data. If the decryption extension is not loaded, the WAL reader
will encounter this unknown block ID and fail-fast, preventing the system
from incorrectly interpreting encrypted data as valid WAL records.PageHeader Transform ID (5-bit)
I have allocated bits 3-7 of pd_flags in the PageHeader for a Transform
ID. This allows the engine and extensions to identify the transformation
state of a page (e.g., key versioning or algorithm type) without attempting
decryption. It ensures backward compatibility: pages with Transform ID 0
are treated as standard untransformed pages.Memory and Critical Section Safety
As demonstrated in the contrib/test_tde reference implementation, cipher
contexts are pre-allocated in _PG_init to avoid memory allocation during
critical sections. For WAL transformation,
MemoryContextAllowInCriticalSection() is used to allow buffer reallocation
within critical sections; if OOM occurs during buffer growth, it results in
a controlled PANIC.Performance Considerations
When hooks are not set (default), the overhead is limited to a single NULL
pointer comparison per I/O operation. This is architecturally consistent
with existing PostgreSQL hooks and is designed to have a negligible impact
on performance.Attached Patches:
v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patch: Core
infrastructure.
v20251228-0002-Add-test_tde-extension-for-TDE-testing.patch: Reference
implementation using AES-256-CTR.I look forward to your comments and feedback.
Regards,
Henson Choi
2025년 12월 28일 (일) PM 4:49, Henson Choi <assam258@gmail.com>님이 작성:
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & MotivationThis RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert GroupsThe core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were mired
in the technicalities of “how to hardcode encryption into the core,” this
proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”- *RDBMS Experts* provide a trusted pipeline responsible for data I/O
paths and consistency.
- *Security Experts* take responsibility for the specialized domain
of encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data securely
without encroaching on each other’s territory.
------------------------------
2. Design Principles1. *Delegation of Authority:* The core remains independent of
specific encryption standards, providing a “free territory” where security
experts can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are
not in use, ensuring the native performance of the PostgreSQL engine.------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)We allow intervention by security experts through five contact points
along the I/O path:- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a page.
This serves as a *Status Message* sent by the security expert to the
engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of Conduct
for Security ExpertsThis reference implementation exists not as a commercial product, but to
define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to maintain
system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression, auditing).------------------------------
5. Scope- *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection of
specific cryptographic algorithms, and integration with external tools.This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can
perform at their best.
Attachments:
v20251228-v2-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchapplication/octet-stream; name=v20251228-v2-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchDownload
From 39d19fc7127124e007ce6bede487209afba6d827 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:50:12 +0900
Subject: [PATCH] Add Storage I/O Transform Hooks for PostgreSQL
This patch introduces a set of hook points that allow extensions to
intercept and transform data during storage I/O operations. The hooks
are designed to support transparent data encryption (TDE) and similar
use cases that require data transformation at the storage layer.
The following hooks are added:
- page_encrypt_hook / page_decrypt_hook in bufmgr.c for buffer page
transformation during read/write operations
- xlog_insert_pre_hook in xloginsert.c for WAL record transformation
before assembly
- xlog_decrypt_record_hook in xlogreader.c for WAL record
transformation during replay
- smgr_write_transform_hook / smgr_read_transform_hook in md.c for
low-level storage manager I/O transformation
Each hook is optional and defaults to NULL, ensuring no overhead when
extensions are not loaded.
Author: Henson Choi <assam258@gmail.com>
---
src/backend/access/transam/xloginsert.c | 10 ++++
src/backend/access/transam/xlogreader.c | 21 ++++++++
src/backend/storage/buffer/bufmgr.c | 9 ++++
src/backend/storage/smgr/md.c | 20 ++++++++
src/include/access/xloginsert.h | 20 ++++++++
src/include/access/xlogreader.h | 20 ++++++++
src/include/access/xlogrecord.h | 5 ++
src/include/storage/bufpage.h | 25 +++++++++-
src/include/storage/md.h | 65 +++++++++++++++++++++++++
9 files changed, 194 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a56d5a55282..f518ef3f16f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -136,6 +136,12 @@ static bool begininsert_called = false;
/* Memory context to hold the registered buffer and data references. */
static MemoryContext xloginsert_cxt;
+/*
+ * Hook variable for WAL insert transformation (e.g., encryption).
+ * Extensions can set this hook to transform WAL data before assembly.
+ */
+xlog_insert_pre_hook_type xlog_insert_pre_hook = NULL;
+
static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr RedoRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn, int *num_fpi,
@@ -526,6 +532,10 @@ XLogInsert(RmgrId rmid, uint8 info)
&fpw_lsn, &num_fpi, &fpi_bytes,
&topxid_included);
+ /* Pre-insert hook for transformation (e.g., encryption) */
+ if (xlog_insert_pre_hook)
+ rdt = xlog_insert_pre_hook(rdt);
+
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
fpi_bytes, topxid_included);
} while (!XLogRecPtrIsValid(EndPos));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5e5001b2101..169f2b06fc5 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -40,6 +40,13 @@
#include "common/logging.h"
#endif
+/*
+ * Hook variable for WAL record transformation (e.g., decryption).
+ * Extensions can set this hook to transform raw WAL data before decoding.
+ * Frontend tools can also set this hook at startup.
+ */
+xlog_decode_pre_hook_type xlog_decode_pre_hook = NULL;
+
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
pg_attribute_printf(2, 3);
static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -843,6 +850,11 @@ restart:
Assert(gotheader);
record = (XLogRecord *) state->readRecordBuf;
+
+ /* Pre-validation hook for transformation (e.g., decryption) */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, true);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
@@ -862,6 +874,15 @@ restart:
goto err;
/* Record does not cross a page boundary */
+
+ /*
+ * Pre-validation hook for transformation (e.g., decryption).
+ * inplace_allowed is false because record points to readBuf, which
+ * may be copied back to WAL files (e.g., FinishWalRecovery).
+ */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, false);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb55102b0d7..eb13a17fa94 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/md.h"
#include "storage/proc.h"
#include "storage/read_stream.h"
#include "storage/smgr.h"
@@ -7401,6 +7402,14 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ);
#endif
+ /* Decrypt block before checksum verification */
+ if (mdread_post_hook)
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+ mdread_post_hook(&rlocator, tag.forkNum, tag.blockNum, &bufdata, 1);
+ }
+
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 71bcdeb6601..5416128d2cc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -96,6 +96,14 @@ typedef struct _MdfdVec
static MemoryContext MdCxt; /* context for all MdfdVec objects */
+/*
+ * Hook variables for I/O transformation (e.g., encryption/decryption).
+ * Extensions can set these hooks to transform data during storage I/O.
+ */
+mdread_post_hook_type mdread_post_hook = NULL;
+mdwrite_pre_hook_type mdwrite_pre_hook = NULL;
+mdextend_pre_hook_type mdextend_pre_hook = NULL;
+
/* Populate a file tag describing an md.c segment file. */
#define INIT_MD_FILETAG(a,xx_rlocator,xx_forknum,xx_segno) \
@@ -513,6 +521,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
relpath(reln->smgr_rlocator, forknum).str,
InvalidBlockNumber)));
+ /* Pre-extend hook for transformation (e.g., encryption) */
+ if (mdextend_pre_hook)
+ buffer = mdextend_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -972,6 +984,10 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
}
+ /* Post-read hook for transformation (e.g., decryption) */
+ if (mdread_post_hook)
+ mdread_post_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks_this_segment);
+
nblocks -= nblocks_this_segment;
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
@@ -1064,6 +1080,10 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
#endif
+ /* Pre-write hook for transformation (e.g., encryption) */
+ if (mdwrite_pre_hook)
+ buffers = mdwrite_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks);
+
while (nblocks > 0)
{
struct iovec iov[PG_IOV_MAX];
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index d6a71415d4f..cc54459ad33 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -19,6 +19,26 @@
#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+/* Forward declaration for XLogRecData */
+struct XLogRecData;
+
+/*
+ * Hook function type for WAL insert transformation (e.g., encryption).
+ * Called after XLogRecordAssemble() but before XLogInsertRecord().
+ * Extension can transform the assembled WAL record data for encryption.
+ * Returns the (possibly modified) XLogRecData chain to be inserted.
+ *
+ * The first node's data points to XLogRecord header, which contains
+ * xl_rmid and xl_info if needed by the hook.
+ *
+ * On failure, the hook should either PANIC or return the original rdata
+ * as fallback.
+ */
+typedef struct XLogRecData *(*xlog_insert_pre_hook_type) (struct XLogRecData *rdata);
+
+/* Hook variable for WAL insert transformation */
+extern PGDLLIMPORT xlog_insert_pre_hook_type xlog_insert_pre_hook;
+
/*
* The minimum size of the WAL construction working area. If you need to
* register more than XLR_NORMAL_MAX_BLOCK_ID block references or have more
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index dfabbbd57d4..898d52a1013 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -400,6 +400,26 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
XLogRecPtr lsn,
char **errormsg);
+/*
+ * Hook function type for WAL record transformation (e.g., decryption).
+ * Called before ValidXLogRecord() and DecodeXLogRecord().
+ * Extension can decrypt or transform the raw record data.
+ * Returns the (possibly modified) XLogRecord to be validated and decoded.
+ *
+ * If inplace_allowed is true, the hook may modify the record in place.
+ * If false, the hook must allocate a new buffer and return it.
+ *
+ * On failure, the hook should either PANIC or return the original record
+ * as fallback.
+ */
+typedef XLogRecord *(*xlog_decode_pre_hook_type) (XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Hook variable for WAL record transformation */
+extern PGDLLIMPORT xlog_decode_pre_hook_type xlog_decode_pre_hook;
+
/*
* Macros that provide access to parts of the record most recently returned by
* XLogReadRecord() or XLogNextRecord().
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index a06833ce0a3..9cfb2aff5ae 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -244,5 +244,10 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#define XLR_BLOCK_ID_TOPLEVEL_XID 252
+/*
+ * I/O transform hook marker. Uses same header format as XLogRecordDataHeaderLong
+ * (1 byte id + 4 bytes length). Use SizeOfXLogRecordDataHeaderLong for size.
+ */
+#define XLR_BLOCK_ID_TRANSFORMED 251
#endif /* XLOGRECORD_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..f18f77d3d22 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -189,7 +189,17 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+/*
+ * Transform ID field (5 bits: values 0-31) for I/O transform extensions.
+ * Value 0 means the page is not transformed (backward compatible).
+ * Values 1-31 are available for extensions to define their own meanings
+ * (e.g., encryption key versions, algorithm identifiers, migration markers).
+ */
+#define PD_TRANSFORM_ID_MASK 0x00F8 /* bits 3-7 */
+#define PD_TRANSFORM_ID_SHIFT 3
+#define PD_TRANSFORM_NONE 0 /* not transformed (core reserved) */
+
+#define PD_VALID_FLAG_BITS 0x00FF /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -441,6 +451,19 @@ PageClearAllVisible(Page page)
((PageHeader) page)->pd_flags &= ~PD_ALL_VISIBLE;
}
+static inline uint8
+PageGetTransformId(const PageData *page)
+{
+ return (((const PageHeaderData *) page)->pd_flags & PD_TRANSFORM_ID_MASK) >> PD_TRANSFORM_ID_SHIFT;
+}
+static inline void
+PageSetTransformId(Page page, uint8 id)
+{
+ ((PageHeader) page)->pd_flags =
+ (((PageHeader) page)->pd_flags & ~PD_TRANSFORM_ID_MASK) |
+ ((id << PD_TRANSFORM_ID_SHIFT) & PD_TRANSFORM_ID_MASK);
+}
+
/*
* These two require "access/transam.h", so left as macros.
*/
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b563c27abf0..0a766a2b61f 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -22,6 +22,71 @@
extern PGDLLIMPORT const PgAioHandleCallbacks aio_md_readv_cb;
+/*
+ * Hook function types for I/O transformation (e.g., encryption/decryption).
+ * These hooks allow extensions to transform data during storage I/O operations.
+ */
+
+/*
+ * Called after blocks are read from disk, before PostgreSQL's checksum verification.
+ * Extension can reverse-transform (e.g., decrypt) the data in place.
+ *
+ * For synchronous reads, called from mdreadv() after read completes.
+ * For AIO reads, called from buffer_readv_complete_one() before PageIsVerified().
+ *
+ * Note: The hook is responsible for verifying on-disk checksum before reverse
+ * transformation and recalculating checksum after transformation. This ensures
+ * data integrity is verified at both stages and PostgreSQL's checksum verification
+ * passes.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors).
+ */
+typedef void (*mdread_post_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdwritev() writes blocks to disk.
+ * Extension can transform (e.g., encrypt) data.
+ * Returns pointer to transformed buffers array (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: The hook should recalculate checksum on transformed data after
+ * transformation. This on-disk checksum will be verified on read before
+ * reverse transformation, ensuring disk-level data integrity.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffers with a WARNING as fallback.
+ */
+typedef const void **(*mdwrite_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdextend() extends a relation with new blocks.
+ * Returns pointer to transformed buffer (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: Same as write hook - the hook should recalculate checksum on
+ * transformed data after transformation.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffer with a WARNING as fallback.
+ */
+typedef const void *(*mdextend_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+
+/* Hook variables for I/O transformation */
+extern PGDLLIMPORT mdread_post_hook_type mdread_post_hook;
+extern PGDLLIMPORT mdwrite_pre_hook_type mdwrite_pre_hook;
+extern PGDLLIMPORT mdextend_pre_hook_type mdextend_pre_hook;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
--
2.50.1 (Apple Git-155)
v20251228-v2-0002-Add-test_tde-extension-for-TDE-testing.patchapplication/octet-stream; name=v20251228-v2-0002-Add-test_tde-extension-for-TDE-testing.patchDownload
From caf38b1f47fd7f33e47837cd544556bd53c833f9 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:51:13 +0900
Subject: [PATCH] Add test_tde extension for TDE testing
This extension provides a reference implementation for validating the
Storage I/O Transform Hooks introduced in the previous commit. It uses
AES-256-CTR encryption with IV derived from page metadata (LSN, block
number, relation file number) to ensure uniqueness.
The extension registers hooks for:
- Buffer page read/write transformation (mdread/mdwrite/mdextend)
- WAL record insert and replay transformation
Key features:
- Encryption key configured via test_tde.key GUC (256-bit hex)
- System catalogs and pg_global tablespace excluded from encryption
- Pre-allocated cipher context to avoid allocation in critical sections
- WAL records marked with block ID 251 for encrypted record detection
This is intended for development and testing purposes only, not for
production use. The implementation lacks key rotation, proper key
management, and security auditing.
Author: Henson Choi <assam258@gmail.com>
---
contrib/Makefile | 4 +-
contrib/meson.build | 1 +
contrib/test_tde/.gitignore | 3 +
contrib/test_tde/Makefile | 27 +
contrib/test_tde/expected/basic.out | 177 +++++
contrib/test_tde/meson.build | 37 +
contrib/test_tde/sql/basic.sql | 146 ++++
contrib/test_tde/test_tde.c | 1131 +++++++++++++++++++++++++++
contrib/test_tde/test_tde.conf | 2 +
9 files changed, 1526 insertions(+), 2 deletions(-)
create mode 100644 contrib/test_tde/.gitignore
create mode 100644 contrib/test_tde/Makefile
create mode 100644 contrib/test_tde/expected/basic.out
create mode 100644 contrib/test_tde/meson.build
create mode 100644 contrib/test_tde/sql/basic.sql
create mode 100644 contrib/test_tde/test_tde.c
create mode 100644 contrib/test_tde/test_tde.conf
diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..151eb823850 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -54,9 +54,9 @@ SUBDIRS = \
vacuumlo
ifeq ($(with_ssl),openssl)
-SUBDIRS += pgcrypto sslinfo
+SUBDIRS += pgcrypto sslinfo test_tde
else
-ALWAYS_SUBDIRS += pgcrypto sslinfo
+ALWAYS_SUBDIRS += pgcrypto sslinfo test_tde
endif
ifneq ($(with_uuid),no)
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..a592b947702 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -65,6 +65,7 @@ subdir('sslinfo')
subdir('tablefunc')
subdir('tcn')
subdir('test_decoding')
+subdir('test_tde')
subdir('tsm_system_rows')
subdir('tsm_system_time')
subdir('unaccent')
diff --git a/contrib/test_tde/.gitignore b/contrib/test_tde/.gitignore
new file mode 100644
index 00000000000..2ea3752951a
--- /dev/null
+++ b/contrib/test_tde/.gitignore
@@ -0,0 +1,3 @@
+log
+results
+tmp_check
diff --git a/contrib/test_tde/Makefile b/contrib/test_tde/Makefile
new file mode 100644
index 00000000000..b2455d3831e
--- /dev/null
+++ b/contrib/test_tde/Makefile
@@ -0,0 +1,27 @@
+# contrib/test_tde/Makefile
+
+MODULE_big = test_tde
+OBJS = \
+ $(WIN32RES) \
+ test_tde.o
+
+PGFILEDESC = "test_tde - reference implementation for I/O transform hooks"
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_tde/test_tde.conf
+REGRESS = basic
+# Disabled because these tests require "shared_preload_libraries=test_tde"
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_tde
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# OpenSSL is required for encryption
+SHLIB_LINK += $(filter -lcrypto, $(LIBS))
diff --git a/contrib/test_tde/expected/basic.out b/contrib/test_tde/expected/basic.out
new file mode 100644
index 00000000000..9932cf43614
--- /dev/null
+++ b/contrib/test_tde/expected/basic.out
@@ -0,0 +1,177 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+-- Show current settings
+SHOW test_tde.key;
+ test_tde.key
+------------------------------------------------------------------
+ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
+(1 row)
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+ id | secret_data | secret_number
+----+------------------------+---------------
+ 1 | This is secret data | 12345
+ 2 | Another secret message | 67890
+ 3 | PostgreSQL TDE test | 11111
+(3 rows)
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+ id | secret_data | secret_number
+----+----------------+---------------
+ 1 | Updated secret | 12345
+(1 row)
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+ count
+-------
+ 13
+(1 row)
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+ id | secret_data | secret_number
+----+-------------+---------------
+ 14 | |
+(1 row)
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+ secret_data
+----------------
+ Updated secret
+(1 row)
+
+-- Clean up
+DROP TABLE test_encrypt;
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+-----------------
+ 1 | before truncate
+(1 row)
+
+TRUNCATE test_truncate;
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+----------------
+ 2 | after truncate
+(1 row)
+
+DROP TABLE test_truncate;
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+CLUSTER test_cluster USING test_cluster_pkey;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+ count
+-------
+ 100
+(1 row)
+
+SELECT * FROM test_cluster WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+DROP TABLE test_cluster;
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+VACUUM FULL test_vacuum_full;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_vacuum_full;
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+REINDEX INDEX test_reindex_pkey;
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+RESET enable_seqscan;
+DROP TABLE test_reindex;
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/meson.build b/contrib/test_tde/meson.build
new file mode 100644
index 00000000000..329e1a4b8e2
--- /dev/null
+++ b/contrib/test_tde/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+if not ssl.found()
+ subdir_done()
+endif
+
+test_tde_sources = files(
+ 'test_tde.c',
+)
+
+if host_system == 'windows'
+ test_tde_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tde',
+ '--FILEDESC', 'test_tde - reference implementation for I/O transform hooks',])
+endif
+
+test_tde = shared_module('test_tde',
+ test_tde_sources,
+ kwargs: contrib_mod_args + {
+ 'dependencies': [ssl, contrib_mod_args['dependencies']]
+ },
+)
+contrib_targets += test_tde
+
+tests += {
+ 'name': 'test_tde',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'basic',
+ ],
+ 'regress_args': ['--temp-config', files('test_tde.conf')],
+ # Disabled because these tests require "shared_preload_libraries=test_tde"
+ 'runningcheck': false,
+ },
+}
diff --git a/contrib/test_tde/sql/basic.sql b/contrib/test_tde/sql/basic.sql
new file mode 100644
index 00000000000..9b2651afee8
--- /dev/null
+++ b/contrib/test_tde/sql/basic.sql
@@ -0,0 +1,146 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+
+-- Show current settings
+SHOW test_tde.key;
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+
+-- Clean up
+DROP TABLE test_encrypt;
+
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+
+TRUNCATE test_truncate;
+
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+
+DROP TABLE test_truncate;
+
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+CLUSTER test_cluster USING test_cluster_pkey;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+SELECT * FROM test_cluster WHERE id = 50;
+
+DROP TABLE test_cluster;
+
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+
+VACUUM FULL test_vacuum_full;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+
+DROP TABLE test_vacuum_full;
+
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+REINDEX INDEX test_reindex_pkey;
+
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+RESET enable_seqscan;
+
+DROP TABLE test_reindex;
+
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/test_tde.c b/contrib/test_tde/test_tde.c
new file mode 100644
index 00000000000..f70359f1c26
--- /dev/null
+++ b/contrib/test_tde/test_tde.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_tde.c
+ * Reference implementation for Storage I/O Transform Hooks
+ *
+ * WARNING: This is for TESTING ONLY. Do not use in production.
+ * - Key stored in plaintext GUC
+ * - No key rotation
+ * - Minimal error handling
+ * - Not audited for security
+ *
+ * For production TDE, use a dedicated extension project.
+ *
+ * This extension demonstrates how to use the storage I/O transform hooks
+ * for transparent data encryption. It uses AES-256-CTR for encryption
+ * with IV derived from page metadata and block location.
+ *
+ * Author: Henson Choi <assam258@gmail.com>
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/test_tde/test_tde.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <openssl/err.h>
+#include <openssl/evp.h>
+#include <string.h>
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "catalog/pg_tablespace_d.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "storage/md.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_tde",
+ .version = PG_VERSION
+);
+
+/* ----------
+ * GUC variables
+ * ----------
+ */
+static char *test_tde_key_hex = NULL; /* 64 hex chars = 256 bits */
+
+/* ----------
+ * Module state
+ * ----------
+ */
+
+/*
+ * Memory context for encryption buffers.
+ * Allows allocation in critical sections (for WAL encryption).
+ */
+static MemoryContext test_tde_cxt = NULL;
+
+/*
+ * Transform ID for this extension.
+ * Value 1 means page is encrypted with test_tde.
+ * Value 0 means page is not transformed (plaintext).
+ */
+#define TEST_TDE_TRANSFORM_ID 1
+
+/*
+ * Dynamic buffers for encrypted pages.
+ * Grows as needed, freed in _PG_fini.
+ */
+static char *encrypt_buffer = NULL;
+static const void **encrypt_buffer_ptrs = NULL;
+static BlockNumber encrypt_buffer_nblocks = 0;
+
+/*
+ * WAL encryption buffer - allocated from test_tde_cxt which allows
+ * allocation in critical sections via MemoryContextAllowInCriticalSection().
+ */
+static char *wal_encrypt_buffer = NULL;
+static Size wal_encrypt_buffer_size = 0;
+
+/*
+ * WAL decryption buffer - static, only needed for records within a single page.
+ * When inplace_allowed=false, record doesn't cross page boundary, so max size
+ * is XLOG_BLCKSZ.
+ */
+static char wal_decrypt_buffer[XLOG_BLCKSZ];
+
+/*
+ * Pre-allocated OpenSSL cipher context.
+ * Created in _PG_init() and reused for all encrypt/decrypt operations.
+ * This avoids memory allocation in critical sections.
+ */
+static EVP_CIPHER_CTX *cipher_ctx = NULL;
+
+/*
+ * Transformed WAL record structure (using XLR_BLOCK_ID_TRANSFORMED from xlogrecord.h):
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV (16B)]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as transformed. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ *
+ * Note: The 21-byte overhead may temporarily cause xl_tot_len to exceed
+ * XLogRecordMaxSize after encryption. This is safe because:
+ * - XLogRecordMaxSize is only checked in XLogRecordAssemble() before our hook
+ * - XLogInsertRecord() does not re-validate the size
+ * - The decode hook removes the overhead before WAL parsing, restoring the
+ * original size which was already validated
+ */
+#define WAL_ENCRYPT_IV_SIZE 16
+#define WAL_ENCRYPT_OVERHEAD (SizeOfXLogRecordDataHeaderLong + WAL_ENCRYPT_IV_SIZE)
+#define WAL_CRC_SIZE sizeof(pg_crc32c) /* 4 bytes */
+#define WAL_IV_RANDOM_SIZE (WAL_ENCRYPT_IV_SIZE - WAL_CRC_SIZE) /* 12 bytes */
+
+/* Static XLogRecData for returning encrypted WAL */
+static XLogRecData wal_rdata_head;
+
+/* Previous hook values (for chaining) */
+static mdread_post_hook_type prev_mdread_post_hook = NULL;
+static mdwrite_pre_hook_type prev_mdwrite_pre_hook = NULL;
+static mdextend_pre_hook_type prev_mdextend_pre_hook = NULL;
+static xlog_insert_pre_hook_type prev_xlog_insert_pre_hook = NULL;
+static xlog_decode_pre_hook_type prev_xlog_decode_pre_hook = NULL;
+
+/* ----------
+ * Function declarations
+ * ----------
+ */
+
+/* Module entry points */
+void _PG_init(void);
+void _PG_fini(void);
+
+/* GUC callbacks */
+static bool check_test_tde_key(char **newval, void **extra, GucSource source);
+
+/* Hook functions */
+static void test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks);
+static const void **test_tde_mdwrite_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+static const void *test_tde_mdextend_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+static struct XLogRecData *test_tde_xlog_insert_pre(struct XLogRecData *rdata);
+static XLogRecord *test_tde_xlog_decode_pre(XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Internal helper functions */
+static void ensure_encrypt_buffer(BlockNumber nblocks);
+static bool parse_hex_key(const char *hex, unsigned char *out, int outlen);
+static void derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn);
+static void transform_data(const unsigned char *in, unsigned char *out,
+ int len, const unsigned char *iv);
+static bool should_transform(RelFileLocator *rlocator, ForkNumber forknum);
+
+
+/* ----------
+ * Internal helper functions
+ * ----------
+ */
+
+/*
+ * Parse hex string to bytes
+ */
+static bool
+parse_hex_key(const char *hex, unsigned char *out, int outlen)
+{
+ int i;
+ int hexlen;
+
+ if (hex == NULL)
+ return false;
+
+ hexlen = strlen(hex);
+ if (hexlen != outlen * 2)
+ return false;
+
+ for (i = 0; i < outlen; i++)
+ {
+ int hi,
+ lo;
+ char c;
+
+ c = hex[i * 2];
+ if (c >= '0' && c <= '9')
+ hi = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ hi = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ hi = c - 'A' + 10;
+ else
+ return false;
+
+ c = hex[i * 2 + 1];
+ if (c >= '0' && c <= '9')
+ lo = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ lo = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ lo = c - 'A' + 10;
+ else
+ return false;
+
+ out[i] = (hi << 4) | lo;
+ }
+
+ return true;
+}
+
+/*
+ * Ensure encrypt buffer can hold 'nblocks' pages.
+ * Grows by 2x when needed. Uses test_tde_cxt for persistence.
+ */
+static void
+ensure_encrypt_buffer(BlockNumber nblocks)
+{
+ if (encrypt_buffer_nblocks >= nblocks)
+ return;
+
+ if (encrypt_buffer == NULL)
+ {
+ BlockNumber initial = Max(8, nblocks);
+ Size size = (Size) initial * BLCKSZ;
+
+ encrypt_buffer = MemoryContextAllocAligned(test_tde_cxt, size,
+ PG_IO_ALIGN_SIZE, 0);
+ encrypt_buffer_ptrs = MemoryContextAlloc(test_tde_cxt,
+ initial * sizeof(void *));
+ encrypt_buffer_nblocks = initial;
+ }
+ else
+ {
+ BlockNumber new_nblocks = encrypt_buffer_nblocks;
+ Size new_size;
+
+ while (new_nblocks < nblocks)
+ new_nblocks *= 2;
+
+ new_size = (Size) new_nblocks * BLCKSZ;
+
+ /* repalloc doesn't preserve alignment, so allocate new and copy */
+ {
+ char *new_buffer = MemoryContextAllocAligned(test_tde_cxt,
+ new_size,
+ PG_IO_ALIGN_SIZE, 0);
+
+ memcpy(new_buffer, encrypt_buffer,
+ (Size) encrypt_buffer_nblocks * BLCKSZ);
+ pfree(encrypt_buffer);
+ encrypt_buffer = new_buffer;
+ }
+
+ encrypt_buffer_ptrs = repalloc(encrypt_buffer_ptrs,
+ new_nblocks * sizeof(void *));
+ encrypt_buffer_nblocks = new_nblocks;
+ }
+
+ /* Update pointers array */
+ for (BlockNumber i = 0; i < encrypt_buffer_nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_buffer + (Size) i * BLCKSZ;
+}
+
+
+/*
+ * Derive IV from page location and header
+ *
+ * IV structure (16 bytes) - simple, deterministic layout:
+ *
+ * AES-CTR mode only requires IV uniqueness, not randomness.
+ * The combination of LSN + RelFileNumber + BlockNumber guarantees uniqueness:
+ * - LSN: Globally unique across entire WAL stream
+ * - RelFileNumber: Unique within database
+ * - BlockNumber: Unique within relation
+ *
+ * Even when a single WAL record modifies multiple pages (e.g., B-tree split),
+ * the BlockNumber distinguishes each page.
+ *
+ * Layout (high entropy bytes first, low entropy bytes last for CTR counter space):
+ * [0-3] LSN low 32 bits - changes frequently (high entropy)
+ * [4-5] LSN bits 32-47 - mid entropy
+ * [6-8] BlockNumber low 24 bits
+ * [9-11] RelFileNumber low 24 bits
+ * [12] BlockNumber high 8 bits - usually 0 for small tables
+ * [13] RelFileNumber high 8 bits - usually 0
+ * [14-15] LSN bits 48-63 - usually 0, counter space for CTR
+ *
+ * CTR counter space analysis:
+ * - Page size: 8KB, encrypted area: 8168 bytes (excluding 24-byte header)
+ * - AES block size: 16 bytes
+ * - Counter increments per page: 8168/16 = 511 (0x1FF)
+ * - Counter affects only IV[14-15] (max increment 0x1FF < 0x10000)
+ * - Bytes 12-15 provide 2^32 counter space, far exceeding 511 needed
+ * - Collision requires same IV[0-11], which means same LSN+BlockNum+RelNum
+ *
+ * Note: spcOid, dbOid not used - RelFileNumber is sufficient for uniqueness.
+ *
+ * Known limitation: Operations that copy/move files while changing
+ * RelFileNumber without going through storage hooks cause decryption failure.
+ */
+static void
+derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn)
+{
+
+ /*
+ * Layout: High entropy first, low entropy (usually 0) last.
+ * [LSN low 4B][LSN mid 2B][BlockNum low 3B][RelNum low 3B]
+ * [BlockNum high 1B][RelNum high 1B][LSN high 2B]
+ */
+
+ /* LSN low 32 bits - bytes 0-3 (high entropy, changes frequently) */
+ iv[0] = (uint8) ((lsn >> 0) & 0xFF);
+ iv[1] = (uint8) ((lsn >> 8) & 0xFF);
+ iv[2] = (uint8) ((lsn >> 16) & 0xFF);
+ iv[3] = (uint8) ((lsn >> 24) & 0xFF);
+
+ /* LSN bits 32-47 - bytes 4-5 (mid entropy) */
+ iv[4] = (uint8) ((lsn >> 32) & 0xFF);
+ iv[5] = (uint8) ((lsn >> 40) & 0xFF);
+
+ /* BlockNumber low 24 bits - bytes 6-8 */
+ iv[6] = (uint8) ((blocknum >> 0) & 0xFF);
+ iv[7] = (uint8) ((blocknum >> 8) & 0xFF);
+ iv[8] = (uint8) ((blocknum >> 16) & 0xFF);
+
+ /* RelFileNumber low 24 bits - bytes 9-11 */
+ iv[9] = (uint8) ((rlocator->relNumber >> 0) & 0xFF);
+ iv[10] = (uint8) ((rlocator->relNumber >> 8) & 0xFF);
+ iv[11] = (uint8) ((rlocator->relNumber >> 16) & 0xFF);
+
+ /* BlockNumber high 8 bits - byte 12 (usually 0 for small tables) */
+ iv[12] = (uint8) ((blocknum >> 24) & 0xFF);
+
+ /* RelFileNumber high 8 bits - byte 13 (usually 0) */
+ iv[13] = (uint8) ((rlocator->relNumber >> 24) & 0xFF);
+
+ /* LSN bits 48-63 - bytes 14-15 (usually 0, counter space for CTR) */
+ iv[14] = (uint8) ((lsn >> 48) & 0xFF);
+ iv[15] = (uint8) ((lsn >> 56) & 0xFF);
+}
+
+/*
+ * Encrypt or decrypt data using AES-256-CTR
+ *
+ * AES-CTR is symmetric: encrypt and decrypt use the same operation.
+ */
+static void
+transform_data(const unsigned char *in, unsigned char *out, int len,
+ const unsigned char *iv)
+{
+ int outlen,
+ tmplen;
+
+ if (len <= 0)
+ return;
+
+ /*
+ * cipher_ctx is pre-allocated and initialized with cipher/key in _PG_init().
+ * Here we only set IV (cipher=NULL, key=NULL), which avoids internal
+ * memory allocation. This is critical for WAL encryption which runs
+ * inside critical sections. We use PANIC for all errors.
+ */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: cipher context not initialized")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, NULL, NULL, NULL, iv) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptInit_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptUpdate(cipher_ctx, out, &outlen, in, len) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptUpdate failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptFinal_ex(cipher_ctx, out + outlen, &tmplen) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptFinal_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+}
+
+/*
+ * Check if we should encrypt/decrypt this relation
+ *
+ * For this test implementation, we encrypt only user-created relations.
+ * A production implementation would check encryption policies.
+ */
+static bool
+should_transform(RelFileLocator *rlocator, ForkNumber forknum)
+{
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return false;
+
+ /* Skip system catalog tablespace (pg_global) */
+ if (rlocator->spcOid == GLOBALTABLESPACE_OID)
+ return false;
+
+ /*
+ * Skip system catalogs (OID < FirstNormalObjectId). This ensures we don't
+ * try to encrypt/decrypt pre-existing system catalog pages that were
+ * created without encryption.
+ */
+ if (rlocator->relNumber < FirstNormalObjectId)
+ return false;
+
+ (void) forknum; /* all forks are encrypted for user tables */
+
+ return true;
+}
+
+
+/* ----------
+ * Hook functions - Page I/O
+ * ----------
+ */
+
+/*
+ * Post-read hook: decrypt blocks after reading from disk
+ */
+static void
+test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+ unsigned char iv[16];
+
+ /* Chain to previous hook if any */
+ if (prev_mdread_post_hook)
+ prev_mdread_post_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ {
+ PageHeader phdr = (PageHeader) buffers[i];
+ uint16 checksum;
+ uint8 transform_id;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffers[i]))
+ continue;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ continue;
+
+ /* Check transform ID - skip if page is not encrypted by us */
+ transform_id = PageGetTransformId((Page) buffers[i]);
+ if (transform_id == PD_TRANSFORM_NONE)
+ continue; /* Page is not encrypted */
+
+ if (transform_id != TEST_TDE_TRANSFORM_ID)
+ {
+ elog(DEBUG1, "test_tde: skipping block %u with transform ID %u (not ours)",
+ blocknum + i, transform_id);
+ continue;
+ }
+
+ /* Page is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted page found but encryption key not configured"),
+ errdetail("Block %u of relation %u/%u/%u fork %d has transform ID %u.",
+ blocknum + i, rlocator->spcOid, rlocator->dbOid,
+ rlocator->relNumber, forknum, transform_id)));
+
+ /* Verify checksum on encrypted data before decryption */
+ if (DataChecksumsEnabled())
+ {
+ checksum = pg_checksum_page((char *) buffers[i], blocknum + i);
+ if (checksum != phdr->pd_checksum)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("page verification failed, calculated checksum %u but expected %u",
+ checksum, phdr->pd_checksum)));
+ }
+ }
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum + i, PageGetLSN((Page) buffers[i]));
+
+ /* Decrypt data area in place (header stays unchanged) */
+ transform_data((unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ (unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Clear transform ID and recalculate checksum for plaintext data */
+ PageSetTransformId((Page) buffers[i], PD_TRANSFORM_NONE);
+ PageSetChecksumInplace((Page) buffers[i], blocknum + i);
+ }
+}
+
+/*
+ * Helper: encrypt a single page into the encrypt_buffer at given offset.
+ * Returns pointer to encrypted page, or original buffer if page was skipped.
+ */
+static const void *
+encrypt_page(RelFileLocator *rlocator, BlockNumber blocknum,
+ const void *buffer, Size buffer_offset)
+{
+ unsigned char iv[16];
+ PageHeader phdr = (PageHeader) buffer;
+ char *dest = encrypt_buffer + buffer_offset;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffer))
+ return buffer;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ return buffer;
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum, PageGetLSN((Page) buffer));
+
+ /* Copy header, encrypt data area */
+ memcpy(dest, buffer, SizeOfPageHeaderData);
+ transform_data((unsigned char *) buffer + SizeOfPageHeaderData,
+ (unsigned char *) dest + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Set transform ID to mark page as encrypted */
+ PageSetTransformId((Page) dest, TEST_TDE_TRANSFORM_ID);
+
+ /* Recalculate checksum for encrypted data */
+ PageSetChecksumInplace((Page) dest, blocknum);
+
+ return dest;
+}
+
+/*
+ * Pre-write hook: encrypt blocks before writing to disk
+ */
+static const void **
+test_tde_mdwrite_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+
+ /* Chain to previous hook if any */
+ if (prev_mdwrite_pre_hook)
+ buffers = prev_mdwrite_pre_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ if (!should_transform(rlocator, forknum))
+ return buffers;
+
+ /* Ensure buffer is large enough */
+ ensure_encrypt_buffer(nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_page(rlocator, blocknum + i,
+ buffers[i], (Size) i * BLCKSZ);
+
+ return encrypt_buffer_ptrs;
+}
+
+/*
+ * Pre-extend hook: encrypt block before extending relation
+ */
+static const void *
+test_tde_mdextend_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void *buffer)
+{
+ /* Chain to previous hook if any */
+ if (prev_mdextend_pre_hook)
+ buffer = prev_mdextend_pre_hook(rlocator, forknum, blocknum, buffer);
+
+ if (!should_transform(rlocator, forknum))
+ return buffer;
+
+ /* Ensure buffer is large enough for at least 1 block */
+ ensure_encrypt_buffer(1);
+
+ return encrypt_page(rlocator, blocknum, buffer, 0);
+}
+
+
+/* ----------
+ * Hook functions - WAL I/O
+ * ----------
+ */
+
+/*
+ * Ensure WAL encryption buffer is large enough.
+ * Uses test_tde_cxt which allows allocation in critical sections.
+ */
+static void
+ensure_wal_encrypt_buffer(Size needed)
+{
+ if (wal_encrypt_buffer_size >= needed)
+ return;
+
+ if (wal_encrypt_buffer == NULL)
+ wal_encrypt_buffer = MemoryContextAlloc(test_tde_cxt, needed);
+ else
+ wal_encrypt_buffer = repalloc(wal_encrypt_buffer, needed);
+ wal_encrypt_buffer_size = needed;
+}
+
+/*
+ * WAL insert pre-hook: encrypt WAL record data
+ *
+ * Strategy:
+ * 1. Copy XLogRecord header and payload
+ * 2. Save plaintext CRC from header (xl_crc contains payload CRC at this point)
+ * 3. Build IV: [plaintext CRC (4B)] [random (12B)]
+ * 4. Insert transformation header (block ID 251 + payload_length) and IV
+ * 5. Encrypt original payload with the IV
+ * 6. Update xl_tot_len and recalculate CRC for encrypted payload
+ *
+ * Resulting record structure:
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV 16B]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as encrypted. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ */
+static struct XLogRecData *
+test_tde_xlog_insert_pre(struct XLogRecData *rdata)
+{
+ XLogRecData *node;
+ XLogRecord *rechdr;
+ char *bufptr;
+ char *new_payload_start;
+ uint32 orig_total_len;
+ uint32 orig_payload_len;
+ uint32 new_total_len;
+ uint32 transform_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ pg_crc32c plaintext_crc;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_insert_pre_hook)
+ rdata = prev_xlog_insert_pre_hook(rdata);
+
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return rdata;
+
+ /* First node must contain XLogRecord header */
+ if (rdata == NULL || rdata->data == NULL || rdata->len < SizeOfXLogRecord)
+ return rdata;
+
+ rechdr = (XLogRecord *) rdata->data;
+ orig_total_len = rechdr->xl_tot_len;
+ orig_payload_len = orig_total_len - SizeOfXLogRecord;
+
+ /* Sanity check */
+ if (orig_total_len < SizeOfXLogRecord)
+ return rdata;
+
+ /*
+ * Skip records with no payload (e.g., XLOG_SWITCH). These are header-only
+ * records where adding encryption overhead would break size assertions.
+ */
+ if (orig_payload_len == 0)
+ return rdata;
+
+ new_total_len = orig_total_len + WAL_ENCRYPT_OVERHEAD;
+
+ /*
+ * Save plaintext CRC before we modify anything.
+ * At this point, xl_crc contains the CRC of the payload only
+ * (header CRC is added later by XLogInsertRecord).
+ */
+ plaintext_crc = rechdr->xl_crc;
+
+ /*
+ * Ensure buffer is large enough. test_tde_cxt allows allocation in
+ * critical sections, so this is safe even during WAL insertion.
+ * OOM here will cause PANIC, which is acceptable for critical sections.
+ */
+ ensure_wal_encrypt_buffer(new_total_len);
+
+ /*
+ * Build IV: [plaintext CRC (4B)] [random (12B)]
+ * Store CRC directly in IV[0..3] (little-endian).
+ */
+ iv[0] = ((uint32) plaintext_crc >> 0) & 0xFF;
+ iv[1] = ((uint32) plaintext_crc >> 8) & 0xFF;
+ iv[2] = ((uint32) plaintext_crc >> 16) & 0xFF;
+ iv[3] = ((uint32) plaintext_crc >> 24) & 0xFF;
+
+ /* Generate random bytes for IV[4..15] (12 bytes) for uniqueness */
+ if (!pg_strong_random(iv + WAL_CRC_SIZE, WAL_IV_RANDOM_SIZE))
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: failed to generate random IV for WAL")));
+ return rdata;
+ }
+
+ /*
+ * Build encrypted record in buffer:
+ * [header][block_id][payload_length][IV][encrypted_payload]
+ */
+ bufptr = wal_encrypt_buffer;
+
+ /* 1. Copy header from first rdata node */
+ memcpy(bufptr, rdata->data, SizeOfXLogRecord);
+ bufptr += SizeOfXLogRecord;
+
+ /* 2. Insert transformation header (block ID 251 + payload_length) */
+ new_payload_start = bufptr;
+ *bufptr = (char) XLR_BLOCK_ID_TRANSFORMED;
+ bufptr += sizeof(uint8);
+
+ /* Calculate payload_length: IV + encrypted payload */
+ transform_payload_len = WAL_ENCRYPT_IV_SIZE + orig_payload_len;
+
+ /* Store payload_length (4 bytes, unaligned, little-endian) */
+ bufptr[0] = (char) ((transform_payload_len >> 0) & 0xFF);
+ bufptr[1] = (char) ((transform_payload_len >> 8) & 0xFF);
+ bufptr[2] = (char) ((transform_payload_len >> 16) & 0xFF);
+ bufptr[3] = (char) ((transform_payload_len >> 24) & 0xFF);
+ bufptr += sizeof(uint32);
+
+ /* 3. Insert IV (CRC in first 4 bytes, random in remaining 12) */
+ memcpy(bufptr, iv, WAL_ENCRYPT_IV_SIZE);
+ bufptr += WAL_ENCRYPT_IV_SIZE;
+
+ /* 4. Copy payload to buffer, then encrypt in-place */
+ if (orig_payload_len > 0)
+ {
+ Size first_node_payload;
+ char *encrypt_start = bufptr;
+
+ /* First node: skip header, copy remaining payload */
+ first_node_payload = rdata->len - SizeOfXLogRecord;
+ if (first_node_payload > 0)
+ {
+ memcpy(bufptr, (char *) rdata->data + SizeOfXLogRecord, first_node_payload);
+ bufptr += first_node_payload;
+ }
+
+ /* Remaining nodes: copy all data */
+ for (node = rdata->next; node != NULL; node = node->next)
+ {
+ if (node->len > 0 && node->data != NULL)
+ {
+ memcpy(bufptr, node->data, node->len);
+ bufptr += node->len;
+ }
+ }
+
+ /* Encrypt payload in-place */
+ transform_data((unsigned char *) encrypt_start,
+ (unsigned char *) encrypt_start,
+ orig_payload_len, iv);
+ }
+
+ /* Update header with new total length */
+ rechdr = (XLogRecord *) wal_encrypt_buffer;
+ rechdr->xl_tot_len = new_total_len;
+
+ /*
+ * Recalculate CRC for the new payload (marker + length + IV + encrypted data).
+ * The header CRC will be added by XLogInsertRecord later.
+ */
+ {
+ pg_crc32c crc;
+
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, new_payload_start, new_total_len - SizeOfXLogRecord);
+ rechdr->xl_crc = crc;
+ }
+
+ /* Return single XLogRecData pointing to our encrypted buffer */
+ wal_rdata_head.next = NULL;
+ wal_rdata_head.data = wal_encrypt_buffer;
+ wal_rdata_head.len = new_total_len;
+
+ return &wal_rdata_head;
+}
+
+/*
+ * WAL decode pre-hook: decrypt WAL record data
+ *
+ * This reverses the encryption done in xlog_insert_pre_hook.
+ * Checks for block ID 251 marker to identify encrypted records.
+ *
+ * Input: [header] [block_id=251 (1B)] [payload_length (4B)] [IV 16B] [encrypted payload]
+ * Output: [header] [original payload] (shorter by 21 bytes)
+ *
+ * Recovery process:
+ * 1. Check for encryption marker (block ID 251)
+ * 2. Read payload_length from transform header
+ * 3. Extract IV for decryption
+ * 4. Decrypt payload using IV
+ * 5. Extract plaintext payload CRC from IV[0..3]
+ * 6. Restore original record structure
+ *
+ * If the marker is not found, record is not encrypted (pass through).
+ * If inplace_allowed, decrypts in place. Otherwise, copies to static buffer.
+ */
+static XLogRecord *
+test_tde_xlog_decode_pre(XLogReaderState *state, XLogRecord *record,
+ XLogRecPtr lsn, bool inplace_allowed)
+{
+ uint32 total_len;
+ uint32 transform_payload_len;
+ uint32 encrypted_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ char *payload_start;
+ char *len_ptr;
+ XLogRecord *work_record;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_decode_pre_hook)
+ record = prev_xlog_decode_pre_hook(state, record, lsn, inplace_allowed);
+
+ if (record == NULL)
+ return record;
+
+ total_len = record->xl_tot_len;
+
+ /* Must have at least header + transform header + IV */
+ if (total_len < SizeOfXLogRecord + WAL_ENCRYPT_OVERHEAD)
+ return record;
+
+ /* Check for transformation marker (block ID 251) */
+ payload_start = (char *) record + SizeOfXLogRecord;
+ if ((unsigned char) *payload_start != XLR_BLOCK_ID_TRANSFORMED)
+ return record; /* Not transformed, pass through */
+
+ /* WAL is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted WAL record found but encryption key not configured"),
+ errdetail("WAL record at LSN %X/%X has transformation marker.",
+ LSN_FORMAT_ARGS(lsn))));
+
+ /*
+ * If inplace modification allowed, work directly on record. Otherwise,
+ * copy to static buffer (record fits in single page).
+ */
+ if (inplace_allowed)
+ {
+ work_record = record;
+ }
+ else
+ {
+ /* Record within single page, must fit in XLOG_BLCKSZ */
+ if (total_len > XLOG_BLCKSZ)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: WAL record too large for decryption buffer")));
+ return record;
+ }
+ memcpy(wal_decrypt_buffer, record, total_len);
+ work_record = (XLogRecord *) wal_decrypt_buffer;
+ }
+
+ /* Recalculate payload_start for work_record */
+ payload_start = (char *) work_record + SizeOfXLogRecord;
+
+ /* Read payload_length from transform header (4 bytes, unaligned, little-endian) */
+ len_ptr = payload_start + sizeof(uint8);
+ transform_payload_len = ((uint32) (unsigned char) len_ptr[0] << 0) |
+ ((uint32) (unsigned char) len_ptr[1] << 8) |
+ ((uint32) (unsigned char) len_ptr[2] << 16) |
+ ((uint32) (unsigned char) len_ptr[3] << 24);
+
+ /* Validate payload_length */
+ if (transform_payload_len < WAL_ENCRYPT_IV_SIZE ||
+ transform_payload_len > total_len - SizeOfXLogRecord - SizeOfXLogRecordDataHeaderLong)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: invalid transform payload length %u at LSN %X/%X",
+ transform_payload_len, LSN_FORMAT_ARGS(lsn))));
+ return record;
+ }
+
+ /* Extract IV (after transform header) */
+ memcpy(iv, payload_start + SizeOfXLogRecordDataHeaderLong, WAL_ENCRYPT_IV_SIZE);
+
+ /* Encrypted payload length = transform_payload_len - IV */
+ encrypted_payload_len = transform_payload_len - WAL_ENCRYPT_IV_SIZE;
+
+ /*
+ * Decrypt payload directly to payload_start position, removing header and IV.
+ * Source: payload_start + 21 (encrypted data after transform header + IV)
+ * Dest: payload_start (overwrite transform header with decrypted data)
+ */
+ if (encrypted_payload_len > 0)
+ {
+ transform_data((unsigned char *) (payload_start + WAL_ENCRYPT_OVERHEAD),
+ (unsigned char *) payload_start,
+ encrypted_payload_len, iv);
+ }
+
+ /* Update header with original length (transform header and IV removed) */
+ work_record->xl_tot_len = SizeOfXLogRecord + encrypted_payload_len;
+
+ /*
+ * Recover plaintext payload CRC from IV[0..3] (little-endian).
+ */
+ {
+ pg_crc32c recovered_payload_crc;
+ pg_crc32c full_crc;
+
+ /* Extract CRC directly from IV[0..3] */
+ recovered_payload_crc = (pg_crc32c) (((uint32) iv[0] << 0) |
+ ((uint32) iv[1] << 8) |
+ ((uint32) iv[2] << 16) |
+ ((uint32) iv[3] << 24));
+
+ /*
+ * For ValidXLogRecord(), we need CRC of: payload + header (up to xl_crc)
+ * The recovered CRC is payload-only, so add header portion.
+ */
+ full_crc = recovered_payload_crc;
+ COMP_CRC32C(full_crc, (char *) work_record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32C(full_crc);
+ work_record->xl_crc = full_crc;
+ }
+
+ return work_record;
+}
+
+
+/* ----------
+ * GUC callbacks
+ * ----------
+ */
+
+/*
+ * GUC check hook for key
+ */
+static bool
+check_test_tde_key(char **newval, void **extra, GucSource source)
+{
+ if (*newval == NULL || strlen(*newval) == 0)
+ return true;
+
+ if (strlen(*newval) != 64)
+ {
+ GUC_check_errdetail("Key must be exactly 64 hex characters (256 bits).");
+ return false;
+ }
+
+ /* Validate hex characters */
+ for (int i = 0; i < 64; i++)
+ {
+ char c = (*newval)[i];
+
+ if (!((c >= '0' && c <= '9') ||
+ (c >= 'a' && c <= 'f') ||
+ (c >= 'A' && c <= 'F')))
+ {
+ GUC_check_errdetail("Key must contain only hex characters (0-9, a-f, A-F).");
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/* ----------
+ * Module entry points
+ * ----------
+ */
+
+/*
+ * Module initialization
+ */
+void
+_PG_init(void)
+{
+ unsigned char key[32];
+
+ /*
+ * Create memory context for encryption buffers and allow allocation
+ * in critical sections. This is necessary because WAL encryption runs
+ * inside critical sections, and OOM there will cause PANIC anyway.
+ */
+ test_tde_cxt = AllocSetContextCreate(TopMemoryContext,
+ "test_tde",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextAllowInCriticalSection(test_tde_cxt, true);
+
+ /*
+ * Define GUC for encryption key.
+ *
+ * PGC_POSTMASTER: Key can only be set at server start to prevent
+ * accidental runtime changes.
+ *
+ * WARNING: Once data is encrypted with a key, that same key MUST be used
+ * for the lifetime of the data. Changing the key (even across restarts)
+ * will cause decryption failures and data corruption. This reference
+ * implementation does not support key rotation.
+ */
+ DefineCustomStringVariable("test_tde.key",
+ "Encryption key in hex format (64 characters = 256 bits).",
+ "WARNING: Key must never change once data is encrypted!",
+ &test_tde_key_hex,
+ "",
+ PGC_POSTMASTER,
+ GUC_SUPERUSER_ONLY,
+ check_test_tde_key,
+ NULL,
+ NULL);
+
+ MarkGUCPrefixReserved("test_tde");
+
+ /*
+ * Parse key and initialize cipher context if key is configured.
+ * cipher_ctx remains NULL if no key is set, disabling encryption.
+ */
+ if (test_tde_key_hex != NULL && strlen(test_tde_key_hex) == 64)
+ {
+ if (!parse_hex_key(test_tde_key_hex, key, 32))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("test_tde: failed to parse encryption key")));
+
+ cipher_ctx = EVP_CIPHER_CTX_new();
+ if (!cipher_ctx)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("test_tde: failed to create cipher context")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, EVP_aes_256_ctr(), NULL, key, NULL) != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: failed to initialize cipher context")));
+
+ /* Clear key from stack */
+ explicit_bzero(key, sizeof(key));
+ }
+
+ /* Install hooks (save previous values for chaining) */
+ prev_mdread_post_hook = mdread_post_hook;
+ mdread_post_hook = test_tde_mdread_post;
+
+ prev_mdwrite_pre_hook = mdwrite_pre_hook;
+ mdwrite_pre_hook = test_tde_mdwrite_pre;
+
+ prev_mdextend_pre_hook = mdextend_pre_hook;
+ mdextend_pre_hook = test_tde_mdextend_pre;
+
+ prev_xlog_insert_pre_hook = xlog_insert_pre_hook;
+ xlog_insert_pre_hook = test_tde_xlog_insert_pre;
+
+ prev_xlog_decode_pre_hook = xlog_decode_pre_hook;
+ xlog_decode_pre_hook = test_tde_xlog_decode_pre;
+
+ ereport(LOG,
+ (errmsg("test_tde: initialized (WARNING: for testing only!)")));
+}
+
+/*
+ * Module finalization
+ */
+void
+_PG_fini(void)
+{
+ /* Restore previous hooks */
+ xlog_decode_pre_hook = prev_xlog_decode_pre_hook;
+ xlog_insert_pre_hook = prev_xlog_insert_pre_hook;
+ mdextend_pre_hook = prev_mdextend_pre_hook;
+ mdwrite_pre_hook = prev_mdwrite_pre_hook;
+ mdread_post_hook = prev_mdread_post_hook;
+
+ /* Free OpenSSL cipher context (also clears key material) */
+ if (cipher_ctx != NULL)
+ {
+ EVP_CIPHER_CTX_free(cipher_ctx);
+ cipher_ctx = NULL;
+ }
+
+ /*
+ * Delete memory context - this frees all buffers allocated from it
+ * (encrypt_buffer, encrypt_buffer_ptrs, wal_encrypt_buffer).
+ */
+ if (test_tde_cxt != NULL)
+ {
+ MemoryContextDelete(test_tde_cxt);
+ test_tde_cxt = NULL;
+ }
+
+ /* Reset buffer pointers */
+ encrypt_buffer = NULL;
+ encrypt_buffer_ptrs = NULL;
+ encrypt_buffer_nblocks = 0;
+ wal_encrypt_buffer = NULL;
+ wal_encrypt_buffer_size = 0;
+}
diff --git a/contrib/test_tde/test_tde.conf b/contrib/test_tde/test_tde.conf
new file mode 100644
index 00000000000..0b00366474c
--- /dev/null
+++ b/contrib/test_tde/test_tde.conf
@@ -0,0 +1,2 @@
+shared_preload_libraries = 'test_tde'
+test_tde.key = '0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef'
--
2.50.1 (Apple Git-155)
On 28/12/2025 9:49 AM, Henson Choi wrote:
RFC: PostgreSQL Storage I/O Transformation Hooks
Infrastructure for a Technical Protocol Between RDBMS Core and
Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------------------------------------------------
1. Summary & Motivation
This RFC proposes the introduction of minimal hooks into the
PostgreSQL storage layer and the addition of a *Transformation ID*
field to the |PageHeader|.A Diplomatic Protocol Between Expert Groups
The core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE)
have often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not
as defendants to be judged by database-specific rules, but as an
*equal neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were
mired in the technicalities of “how to hardcode encryption into the
core,” this proposal shifts the debate toward an architectural
solution: “what interface the core should provide to external experts.”* *RDBMS Experts* provide a trusted pipeline responsible for data
I/O paths and consistency.
* *Security Experts* take responsibility for the specialized domain
of encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data
securely without encroaching on each other’s territory.------------------------------------------------------------------------
2. Design Principles
1. *Delegation of Authority:* The core remains independent of
specific encryption standards, providing a “free territory” where
security experts can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The
engine uses this ID to identify the state of the data and hands
over control to the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are
not in use, ensuring the native performance of the PostgreSQL engine.------------------------------------------------------------------------
3. Proposal Specifications
3.1 The Interface (Hook Points)
We allow intervention by security experts through five contact points
along the I/O path:* *Read/Write Hooks:* |mdread_post|, |mdwrite_pre|, |mdextend_pre|
(Transformation of the data area)
* *WAL Hooks:* |xlog_insert_pre|, |xlog_decode_pre| (Transformation
of transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of |pd_flags| to define the “Security State” of a
page. This serves as a *Status Message* sent by the security expert to
the engine, utilized for key versioning and as a migration marker.------------------------------------------------------------------------
4. Reference Implementation: |contrib/test_tde|
A Standard Code of Conduct for Security Experts
This reference implementation exists not as a commercial product, but
to define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the
engine (e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to
maintain system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression,
auditing).------------------------------------------------------------------------
5. Scope
* *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
* *Out-of-Scope:* Specific Key Management Systems (KMS), selection
of specific cryptographic algorithms, and integration with
external tools.This proposal represents a strategic diplomatic choice: rather than
the PostgreSQL core assuming all security responsibilities, it grants
security experts a *sovereign territory through extensions* where they
can perform at their best.
I wonder if instead of support a lot of extra hooks it will be better to
provide extensible SMGR API:
/messages/by-id/CAPP=Hha_wV1MV9yR70QZ5pk5dtNP+bOyBiFxPmrMKqnQeKMAwQ@mail.gmail.com
It seems to be much more straightforward, convenient and flexible
mechanism than adding hooks, which can be used for many other purposes
except transparent encryption.
Hello!
I am glad to see that there are multiple TDE extension proposals being
worked on. For context, I am one of the developers working on the
pg_tde[1]https://github.com/percona/pg_tde extension, as well as on the extensible SMGR proposal that
Konstantin already linked.
This patch/proposal contains two distinct parts of
encryption/extensibility, WAL and buffer manager/table data. Based on
earlier discussions, the opinions of adding extension points to these
two are quite different, and because of that I'm not sure if bundling
them together is helpful.
It also appears to be missing some extension points that would be
required for a more complete encryption solution, such as encrypting
temporary files or system tables, or handling command-line utilities
like pg_waldump. Do you have ideas or patches in mind for those areas
as well?
I have the same question as Konstantin, why did you choose custom
hooks for the buffer manager instead of the already existing smgr
interface / extensibility patch? While that patch is not part of the
core (but I hope it will be), it is already used by multiple companies
as it supports other use cases, not only encryption. We plan to focus
more on that thread early next year, we would appreciate any
feedback/suggestions that could make it better for others.
I also noticed that you added additional flags to the page header.
Initially we were thinking about something like this, but decided that
the fork files are better for any encryption (or other storage
related) extra data. These few bits try to be generic, while also
restrictive because of the limited amount of data. (and that data is
specifically per page, if I want something per file or per page range,
I still need a custom solution)
Regarding the WAL encryption part, we took a completely different
approach, similar to how we handle normal table data (page-based). I
will need to think more about this before I can provide meaningful
feedback on that part of the patch. One initial question, however, is
whether you have run detailed benchmarks with different workloads.
That seems to be the trickiest part there, since most of the code runs
in a critical section. (Not the "unused"/"empty hook" path, but the
overhead caused by a real encryption plugin using this hook in
practice)
Hi,
Here is v3 of the Storage I/O Transform Hooks patch.
Changes from v2:
- Fix -Wincompatible-pointer-types error in bufmgr.c by casting
&bufdata to (void **) for mdread_post_hook call
v2 changes were:
- Add meson.build test configuration for test_tde extension
--
Best regards,
Sungkyun Park
2025년 12월 28일 (일) PM 7:44, Henson Choi <assam258@gmail.com>님이 작성:
Show quoted text
Updated patches with meson build support:
v2:
- Added meson.build for test_tde extension
- Added test_tde to contrib/meson.buildRegards,
Henson Choi2025년 12월 28일 (일) PM 6:47, Henson Choi <assam258@gmail.com>님이 작성:
Hello,
Following up on the RFC, I am submitting the initial patch set for the
proposed infrastructure. These patches introduce a minimal hook-based
protocol to allow extensions to handle data transformation, such as TDE,
while keeping the PostgreSQL core independent of specific cryptographic
implementations.Implementation Details:
Hook Points in Storage I/O Path
The patch introduces five strategic hook points:mdread_post_hook: Called after blocks are read from disk. The extension
can reverse-transform data in place.mdwrite_pre_hook & mdextend_pre_hook: Called before writing or extending
blocks. These hooks return a pointer to transformed buffers.xlog_insert_pre_hook & xlog_decode_pre_hook: Handle transformation for
WAL records during insertion and replay.Data Integrity and Checksum Protocol
To ensure robust error detection, the hooks follow a specific
verification protocol:On Write: The extension transforms the page, sets the Transform ID, then
recalculates the checksum on the transformed data.On Read: The extension verifies the on-disk checksum of the transformed
data first. After reverse-transformation, it clears the Transform ID and
recalculates the checksum for the plaintext data. This ensures corruption
is detected regardless of the transformation state.WAL Safety via XLR_BLOCK_ID_TRANSFORMED (251)
For WAL records, I have introduced a specific block ID (251) to mark
transformed data. If the decryption extension is not loaded, the WAL reader
will encounter this unknown block ID and fail-fast, preventing the system
from incorrectly interpreting encrypted data as valid WAL records.PageHeader Transform ID (5-bit)
I have allocated bits 3-7 of pd_flags in the PageHeader for a Transform
ID. This allows the engine and extensions to identify the transformation
state of a page (e.g., key versioning or algorithm type) without attempting
decryption. It ensures backward compatibility: pages with Transform ID 0
are treated as standard untransformed pages.Memory and Critical Section Safety
As demonstrated in the contrib/test_tde reference implementation, cipher
contexts are pre-allocated in _PG_init to avoid memory allocation during
critical sections. For WAL transformation,
MemoryContextAllowInCriticalSection() is used to allow buffer reallocation
within critical sections; if OOM occurs during buffer growth, it results in
a controlled PANIC.Performance Considerations
When hooks are not set (default), the overhead is limited to a single
NULL pointer comparison per I/O operation. This is architecturally
consistent with existing PostgreSQL hooks and is designed to have a
negligible impact on performance.Attached Patches:
v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patch: Core
infrastructure.
v20251228-0002-Add-test_tde-extension-for-TDE-testing.patch: Reference
implementation using AES-256-CTR.I look forward to your comments and feedback.
Regards,
Henson Choi
2025년 12월 28일 (일) PM 4:49, Henson Choi <assam258@gmail.com>님이 작성:
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & MotivationThis RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert GroupsThe core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were mired
in the technicalities of “how to hardcode encryption into the core,” this
proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”- *RDBMS Experts* provide a trusted pipeline responsible for data
I/O paths and consistency.
- *Security Experts* take responsibility for the specialized domain
of encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data securely
without encroaching on each other’s territory.
------------------------------
2. Design Principles1. *Delegation of Authority:* The core remains independent of
specific encryption standards, providing a “free territory” where security
experts can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are
not in use, ensuring the native performance of the PostgreSQL engine.------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)We allow intervention by security experts through five contact points
along the I/O path:- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a
page. This serves as a *Status Message* sent by the security expert to
the engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of
Conduct for Security ExpertsThis reference implementation exists not as a commercial product, but to
define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to maintain
system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression, auditing).------------------------------
5. Scope- *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection
of specific cryptographic algorithms, and integration with external tools.This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can
perform at their best.
Attachments:
v20251228-v3-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchapplication/octet-stream; name=v20251228-v3-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchDownload
From 82ce5cc05f1ce0311a2eedd559f1db7a7703f126 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:50:12 +0900
Subject: [PATCH v3 1/2] Add Storage I/O Transform Hooks for PostgreSQL
This patch introduces a set of hook points that allow extensions to
intercept and transform data during storage I/O operations. The hooks
are designed to support transparent data encryption (TDE) and similar
use cases that require data transformation at the storage layer.
The following hooks are added:
- page_encrypt_hook / page_decrypt_hook in bufmgr.c for buffer page
transformation during read/write operations
- xlog_insert_pre_hook in xloginsert.c for WAL record transformation
before assembly
- xlog_decrypt_record_hook in xlogreader.c for WAL record
transformation during replay
- smgr_write_transform_hook / smgr_read_transform_hook in md.c for
low-level storage manager I/O transformation
Each hook is optional and defaults to NULL, ensuring no overhead when
extensions are not loaded.
Author: Henson Choi <assam258@gmail.com>
---
src/backend/access/transam/xloginsert.c | 10 ++++
src/backend/access/transam/xlogreader.c | 21 ++++++++
src/backend/storage/buffer/bufmgr.c | 9 ++++
src/backend/storage/smgr/md.c | 20 ++++++++
src/include/access/xloginsert.h | 20 ++++++++
src/include/access/xlogreader.h | 20 ++++++++
src/include/access/xlogrecord.h | 5 ++
src/include/storage/bufpage.h | 25 +++++++++-
src/include/storage/md.h | 65 +++++++++++++++++++++++++
9 files changed, 194 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a56d5a55282..f518ef3f16f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -136,6 +136,12 @@ static bool begininsert_called = false;
/* Memory context to hold the registered buffer and data references. */
static MemoryContext xloginsert_cxt;
+/*
+ * Hook variable for WAL insert transformation (e.g., encryption).
+ * Extensions can set this hook to transform WAL data before assembly.
+ */
+xlog_insert_pre_hook_type xlog_insert_pre_hook = NULL;
+
static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr RedoRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn, int *num_fpi,
@@ -526,6 +532,10 @@ XLogInsert(RmgrId rmid, uint8 info)
&fpw_lsn, &num_fpi, &fpi_bytes,
&topxid_included);
+ /* Pre-insert hook for transformation (e.g., encryption) */
+ if (xlog_insert_pre_hook)
+ rdt = xlog_insert_pre_hook(rdt);
+
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
fpi_bytes, topxid_included);
} while (!XLogRecPtrIsValid(EndPos));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5e5001b2101..169f2b06fc5 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -40,6 +40,13 @@
#include "common/logging.h"
#endif
+/*
+ * Hook variable for WAL record transformation (e.g., decryption).
+ * Extensions can set this hook to transform raw WAL data before decoding.
+ * Frontend tools can also set this hook at startup.
+ */
+xlog_decode_pre_hook_type xlog_decode_pre_hook = NULL;
+
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
pg_attribute_printf(2, 3);
static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -843,6 +850,11 @@ restart:
Assert(gotheader);
record = (XLogRecord *) state->readRecordBuf;
+
+ /* Pre-validation hook for transformation (e.g., decryption) */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, true);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
@@ -862,6 +874,15 @@ restart:
goto err;
/* Record does not cross a page boundary */
+
+ /*
+ * Pre-validation hook for transformation (e.g., decryption).
+ * inplace_allowed is false because record points to readBuf, which
+ * may be copied back to WAL files (e.g., FinishWalRecovery).
+ */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, false);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb55102b0d7..ea0b62e98f2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/md.h"
#include "storage/proc.h"
#include "storage/read_stream.h"
#include "storage/smgr.h"
@@ -7401,6 +7402,14 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ);
#endif
+ /* Decrypt block before checksum verification */
+ if (mdread_post_hook)
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+ mdread_post_hook(&rlocator, tag.forkNum, tag.blockNum, (void **) &bufdata, 1);
+ }
+
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 71bcdeb6601..5416128d2cc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -96,6 +96,14 @@ typedef struct _MdfdVec
static MemoryContext MdCxt; /* context for all MdfdVec objects */
+/*
+ * Hook variables for I/O transformation (e.g., encryption/decryption).
+ * Extensions can set these hooks to transform data during storage I/O.
+ */
+mdread_post_hook_type mdread_post_hook = NULL;
+mdwrite_pre_hook_type mdwrite_pre_hook = NULL;
+mdextend_pre_hook_type mdextend_pre_hook = NULL;
+
/* Populate a file tag describing an md.c segment file. */
#define INIT_MD_FILETAG(a,xx_rlocator,xx_forknum,xx_segno) \
@@ -513,6 +521,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
relpath(reln->smgr_rlocator, forknum).str,
InvalidBlockNumber)));
+ /* Pre-extend hook for transformation (e.g., encryption) */
+ if (mdextend_pre_hook)
+ buffer = mdextend_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -972,6 +984,10 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
}
+ /* Post-read hook for transformation (e.g., decryption) */
+ if (mdread_post_hook)
+ mdread_post_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks_this_segment);
+
nblocks -= nblocks_this_segment;
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
@@ -1064,6 +1080,10 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
#endif
+ /* Pre-write hook for transformation (e.g., encryption) */
+ if (mdwrite_pre_hook)
+ buffers = mdwrite_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks);
+
while (nblocks > 0)
{
struct iovec iov[PG_IOV_MAX];
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index d6a71415d4f..cc54459ad33 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -19,6 +19,26 @@
#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+/* Forward declaration for XLogRecData */
+struct XLogRecData;
+
+/*
+ * Hook function type for WAL insert transformation (e.g., encryption).
+ * Called after XLogRecordAssemble() but before XLogInsertRecord().
+ * Extension can transform the assembled WAL record data for encryption.
+ * Returns the (possibly modified) XLogRecData chain to be inserted.
+ *
+ * The first node's data points to XLogRecord header, which contains
+ * xl_rmid and xl_info if needed by the hook.
+ *
+ * On failure, the hook should either PANIC or return the original rdata
+ * as fallback.
+ */
+typedef struct XLogRecData *(*xlog_insert_pre_hook_type) (struct XLogRecData *rdata);
+
+/* Hook variable for WAL insert transformation */
+extern PGDLLIMPORT xlog_insert_pre_hook_type xlog_insert_pre_hook;
+
/*
* The minimum size of the WAL construction working area. If you need to
* register more than XLR_NORMAL_MAX_BLOCK_ID block references or have more
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index dfabbbd57d4..898d52a1013 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -400,6 +400,26 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
XLogRecPtr lsn,
char **errormsg);
+/*
+ * Hook function type for WAL record transformation (e.g., decryption).
+ * Called before ValidXLogRecord() and DecodeXLogRecord().
+ * Extension can decrypt or transform the raw record data.
+ * Returns the (possibly modified) XLogRecord to be validated and decoded.
+ *
+ * If inplace_allowed is true, the hook may modify the record in place.
+ * If false, the hook must allocate a new buffer and return it.
+ *
+ * On failure, the hook should either PANIC or return the original record
+ * as fallback.
+ */
+typedef XLogRecord *(*xlog_decode_pre_hook_type) (XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Hook variable for WAL record transformation */
+extern PGDLLIMPORT xlog_decode_pre_hook_type xlog_decode_pre_hook;
+
/*
* Macros that provide access to parts of the record most recently returned by
* XLogReadRecord() or XLogNextRecord().
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index a06833ce0a3..9cfb2aff5ae 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -244,5 +244,10 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#define XLR_BLOCK_ID_TOPLEVEL_XID 252
+/*
+ * I/O transform hook marker. Uses same header format as XLogRecordDataHeaderLong
+ * (1 byte id + 4 bytes length). Use SizeOfXLogRecordDataHeaderLong for size.
+ */
+#define XLR_BLOCK_ID_TRANSFORMED 251
#endif /* XLOGRECORD_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..f18f77d3d22 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -189,7 +189,17 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+/*
+ * Transform ID field (5 bits: values 0-31) for I/O transform extensions.
+ * Value 0 means the page is not transformed (backward compatible).
+ * Values 1-31 are available for extensions to define their own meanings
+ * (e.g., encryption key versions, algorithm identifiers, migration markers).
+ */
+#define PD_TRANSFORM_ID_MASK 0x00F8 /* bits 3-7 */
+#define PD_TRANSFORM_ID_SHIFT 3
+#define PD_TRANSFORM_NONE 0 /* not transformed (core reserved) */
+
+#define PD_VALID_FLAG_BITS 0x00FF /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -441,6 +451,19 @@ PageClearAllVisible(Page page)
((PageHeader) page)->pd_flags &= ~PD_ALL_VISIBLE;
}
+static inline uint8
+PageGetTransformId(const PageData *page)
+{
+ return (((const PageHeaderData *) page)->pd_flags & PD_TRANSFORM_ID_MASK) >> PD_TRANSFORM_ID_SHIFT;
+}
+static inline void
+PageSetTransformId(Page page, uint8 id)
+{
+ ((PageHeader) page)->pd_flags =
+ (((PageHeader) page)->pd_flags & ~PD_TRANSFORM_ID_MASK) |
+ ((id << PD_TRANSFORM_ID_SHIFT) & PD_TRANSFORM_ID_MASK);
+}
+
/*
* These two require "access/transam.h", so left as macros.
*/
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b563c27abf0..0a766a2b61f 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -22,6 +22,71 @@
extern PGDLLIMPORT const PgAioHandleCallbacks aio_md_readv_cb;
+/*
+ * Hook function types for I/O transformation (e.g., encryption/decryption).
+ * These hooks allow extensions to transform data during storage I/O operations.
+ */
+
+/*
+ * Called after blocks are read from disk, before PostgreSQL's checksum verification.
+ * Extension can reverse-transform (e.g., decrypt) the data in place.
+ *
+ * For synchronous reads, called from mdreadv() after read completes.
+ * For AIO reads, called from buffer_readv_complete_one() before PageIsVerified().
+ *
+ * Note: The hook is responsible for verifying on-disk checksum before reverse
+ * transformation and recalculating checksum after transformation. This ensures
+ * data integrity is verified at both stages and PostgreSQL's checksum verification
+ * passes.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors).
+ */
+typedef void (*mdread_post_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdwritev() writes blocks to disk.
+ * Extension can transform (e.g., encrypt) data.
+ * Returns pointer to transformed buffers array (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: The hook should recalculate checksum on transformed data after
+ * transformation. This on-disk checksum will be verified on read before
+ * reverse transformation, ensuring disk-level data integrity.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffers with a WARNING as fallback.
+ */
+typedef const void **(*mdwrite_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdextend() extends a relation with new blocks.
+ * Returns pointer to transformed buffer (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: Same as write hook - the hook should recalculate checksum on
+ * transformed data after transformation.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffer with a WARNING as fallback.
+ */
+typedef const void *(*mdextend_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+
+/* Hook variables for I/O transformation */
+extern PGDLLIMPORT mdread_post_hook_type mdread_post_hook;
+extern PGDLLIMPORT mdwrite_pre_hook_type mdwrite_pre_hook;
+extern PGDLLIMPORT mdextend_pre_hook_type mdextend_pre_hook;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
--
2.50.1 (Apple Git-155)
v20251228-v3-0002-Add-test_tde-extension-for-TDE-testing.patchapplication/octet-stream; name=v20251228-v3-0002-Add-test_tde-extension-for-TDE-testing.patchDownload
From 68179d7770a4bd8abed5aabb261ef1e03f838500 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:51:13 +0900
Subject: [PATCH v3 2/2] Add test_tde extension for TDE testing
This extension provides a reference implementation for validating the
Storage I/O Transform Hooks introduced in the previous commit. It uses
AES-256-CTR encryption with IV derived from page metadata (LSN, block
number, relation file number) to ensure uniqueness.
The extension registers hooks for:
- Buffer page read/write transformation (mdread/mdwrite/mdextend)
- WAL record insert and replay transformation
Key features:
- Encryption key configured via test_tde.key GUC (256-bit hex)
- System catalogs and pg_global tablespace excluded from encryption
- Pre-allocated cipher context to avoid allocation in critical sections
- WAL records marked with block ID 251 for encrypted record detection
This is intended for development and testing purposes only, not for
production use. The implementation lacks key rotation, proper key
management, and security auditing.
Author: Henson Choi <assam258@gmail.com>
---
contrib/Makefile | 4 +-
contrib/meson.build | 1 +
contrib/test_tde/.gitignore | 3 +
contrib/test_tde/Makefile | 27 +
contrib/test_tde/expected/basic.out | 177 +++++
contrib/test_tde/meson.build | 37 +
contrib/test_tde/sql/basic.sql | 146 ++++
contrib/test_tde/test_tde.c | 1131 +++++++++++++++++++++++++++
contrib/test_tde/test_tde.conf | 2 +
9 files changed, 1526 insertions(+), 2 deletions(-)
create mode 100644 contrib/test_tde/.gitignore
create mode 100644 contrib/test_tde/Makefile
create mode 100644 contrib/test_tde/expected/basic.out
create mode 100644 contrib/test_tde/meson.build
create mode 100644 contrib/test_tde/sql/basic.sql
create mode 100644 contrib/test_tde/test_tde.c
create mode 100644 contrib/test_tde/test_tde.conf
diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..151eb823850 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -54,9 +54,9 @@ SUBDIRS = \
vacuumlo
ifeq ($(with_ssl),openssl)
-SUBDIRS += pgcrypto sslinfo
+SUBDIRS += pgcrypto sslinfo test_tde
else
-ALWAYS_SUBDIRS += pgcrypto sslinfo
+ALWAYS_SUBDIRS += pgcrypto sslinfo test_tde
endif
ifneq ($(with_uuid),no)
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..a592b947702 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -65,6 +65,7 @@ subdir('sslinfo')
subdir('tablefunc')
subdir('tcn')
subdir('test_decoding')
+subdir('test_tde')
subdir('tsm_system_rows')
subdir('tsm_system_time')
subdir('unaccent')
diff --git a/contrib/test_tde/.gitignore b/contrib/test_tde/.gitignore
new file mode 100644
index 00000000000..2ea3752951a
--- /dev/null
+++ b/contrib/test_tde/.gitignore
@@ -0,0 +1,3 @@
+log
+results
+tmp_check
diff --git a/contrib/test_tde/Makefile b/contrib/test_tde/Makefile
new file mode 100644
index 00000000000..b2455d3831e
--- /dev/null
+++ b/contrib/test_tde/Makefile
@@ -0,0 +1,27 @@
+# contrib/test_tde/Makefile
+
+MODULE_big = test_tde
+OBJS = \
+ $(WIN32RES) \
+ test_tde.o
+
+PGFILEDESC = "test_tde - reference implementation for I/O transform hooks"
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_tde/test_tde.conf
+REGRESS = basic
+# Disabled because these tests require "shared_preload_libraries=test_tde"
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_tde
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# OpenSSL is required for encryption
+SHLIB_LINK += $(filter -lcrypto, $(LIBS))
diff --git a/contrib/test_tde/expected/basic.out b/contrib/test_tde/expected/basic.out
new file mode 100644
index 00000000000..9932cf43614
--- /dev/null
+++ b/contrib/test_tde/expected/basic.out
@@ -0,0 +1,177 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+-- Show current settings
+SHOW test_tde.key;
+ test_tde.key
+------------------------------------------------------------------
+ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
+(1 row)
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+ id | secret_data | secret_number
+----+------------------------+---------------
+ 1 | This is secret data | 12345
+ 2 | Another secret message | 67890
+ 3 | PostgreSQL TDE test | 11111
+(3 rows)
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+ id | secret_data | secret_number
+----+----------------+---------------
+ 1 | Updated secret | 12345
+(1 row)
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+ count
+-------
+ 13
+(1 row)
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+ id | secret_data | secret_number
+----+-------------+---------------
+ 14 | |
+(1 row)
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+ secret_data
+----------------
+ Updated secret
+(1 row)
+
+-- Clean up
+DROP TABLE test_encrypt;
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+-----------------
+ 1 | before truncate
+(1 row)
+
+TRUNCATE test_truncate;
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+----------------
+ 2 | after truncate
+(1 row)
+
+DROP TABLE test_truncate;
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+CLUSTER test_cluster USING test_cluster_pkey;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+ count
+-------
+ 100
+(1 row)
+
+SELECT * FROM test_cluster WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+DROP TABLE test_cluster;
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+VACUUM FULL test_vacuum_full;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_vacuum_full;
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+REINDEX INDEX test_reindex_pkey;
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+RESET enable_seqscan;
+DROP TABLE test_reindex;
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/meson.build b/contrib/test_tde/meson.build
new file mode 100644
index 00000000000..329e1a4b8e2
--- /dev/null
+++ b/contrib/test_tde/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+if not ssl.found()
+ subdir_done()
+endif
+
+test_tde_sources = files(
+ 'test_tde.c',
+)
+
+if host_system == 'windows'
+ test_tde_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tde',
+ '--FILEDESC', 'test_tde - reference implementation for I/O transform hooks',])
+endif
+
+test_tde = shared_module('test_tde',
+ test_tde_sources,
+ kwargs: contrib_mod_args + {
+ 'dependencies': [ssl, contrib_mod_args['dependencies']]
+ },
+)
+contrib_targets += test_tde
+
+tests += {
+ 'name': 'test_tde',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'basic',
+ ],
+ 'regress_args': ['--temp-config', files('test_tde.conf')],
+ # Disabled because these tests require "shared_preload_libraries=test_tde"
+ 'runningcheck': false,
+ },
+}
diff --git a/contrib/test_tde/sql/basic.sql b/contrib/test_tde/sql/basic.sql
new file mode 100644
index 00000000000..9b2651afee8
--- /dev/null
+++ b/contrib/test_tde/sql/basic.sql
@@ -0,0 +1,146 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+
+-- Show current settings
+SHOW test_tde.key;
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+
+-- Clean up
+DROP TABLE test_encrypt;
+
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+
+TRUNCATE test_truncate;
+
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+
+DROP TABLE test_truncate;
+
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+CLUSTER test_cluster USING test_cluster_pkey;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+SELECT * FROM test_cluster WHERE id = 50;
+
+DROP TABLE test_cluster;
+
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+
+VACUUM FULL test_vacuum_full;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+
+DROP TABLE test_vacuum_full;
+
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+REINDEX INDEX test_reindex_pkey;
+
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+RESET enable_seqscan;
+
+DROP TABLE test_reindex;
+
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/test_tde.c b/contrib/test_tde/test_tde.c
new file mode 100644
index 00000000000..f70359f1c26
--- /dev/null
+++ b/contrib/test_tde/test_tde.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_tde.c
+ * Reference implementation for Storage I/O Transform Hooks
+ *
+ * WARNING: This is for TESTING ONLY. Do not use in production.
+ * - Key stored in plaintext GUC
+ * - No key rotation
+ * - Minimal error handling
+ * - Not audited for security
+ *
+ * For production TDE, use a dedicated extension project.
+ *
+ * This extension demonstrates how to use the storage I/O transform hooks
+ * for transparent data encryption. It uses AES-256-CTR for encryption
+ * with IV derived from page metadata and block location.
+ *
+ * Author: Henson Choi <assam258@gmail.com>
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/test_tde/test_tde.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <openssl/err.h>
+#include <openssl/evp.h>
+#include <string.h>
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "catalog/pg_tablespace_d.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "storage/md.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_tde",
+ .version = PG_VERSION
+);
+
+/* ----------
+ * GUC variables
+ * ----------
+ */
+static char *test_tde_key_hex = NULL; /* 64 hex chars = 256 bits */
+
+/* ----------
+ * Module state
+ * ----------
+ */
+
+/*
+ * Memory context for encryption buffers.
+ * Allows allocation in critical sections (for WAL encryption).
+ */
+static MemoryContext test_tde_cxt = NULL;
+
+/*
+ * Transform ID for this extension.
+ * Value 1 means page is encrypted with test_tde.
+ * Value 0 means page is not transformed (plaintext).
+ */
+#define TEST_TDE_TRANSFORM_ID 1
+
+/*
+ * Dynamic buffers for encrypted pages.
+ * Grows as needed, freed in _PG_fini.
+ */
+static char *encrypt_buffer = NULL;
+static const void **encrypt_buffer_ptrs = NULL;
+static BlockNumber encrypt_buffer_nblocks = 0;
+
+/*
+ * WAL encryption buffer - allocated from test_tde_cxt which allows
+ * allocation in critical sections via MemoryContextAllowInCriticalSection().
+ */
+static char *wal_encrypt_buffer = NULL;
+static Size wal_encrypt_buffer_size = 0;
+
+/*
+ * WAL decryption buffer - static, only needed for records within a single page.
+ * When inplace_allowed=false, record doesn't cross page boundary, so max size
+ * is XLOG_BLCKSZ.
+ */
+static char wal_decrypt_buffer[XLOG_BLCKSZ];
+
+/*
+ * Pre-allocated OpenSSL cipher context.
+ * Created in _PG_init() and reused for all encrypt/decrypt operations.
+ * This avoids memory allocation in critical sections.
+ */
+static EVP_CIPHER_CTX *cipher_ctx = NULL;
+
+/*
+ * Transformed WAL record structure (using XLR_BLOCK_ID_TRANSFORMED from xlogrecord.h):
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV (16B)]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as transformed. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ *
+ * Note: The 21-byte overhead may temporarily cause xl_tot_len to exceed
+ * XLogRecordMaxSize after encryption. This is safe because:
+ * - XLogRecordMaxSize is only checked in XLogRecordAssemble() before our hook
+ * - XLogInsertRecord() does not re-validate the size
+ * - The decode hook removes the overhead before WAL parsing, restoring the
+ * original size which was already validated
+ */
+#define WAL_ENCRYPT_IV_SIZE 16
+#define WAL_ENCRYPT_OVERHEAD (SizeOfXLogRecordDataHeaderLong + WAL_ENCRYPT_IV_SIZE)
+#define WAL_CRC_SIZE sizeof(pg_crc32c) /* 4 bytes */
+#define WAL_IV_RANDOM_SIZE (WAL_ENCRYPT_IV_SIZE - WAL_CRC_SIZE) /* 12 bytes */
+
+/* Static XLogRecData for returning encrypted WAL */
+static XLogRecData wal_rdata_head;
+
+/* Previous hook values (for chaining) */
+static mdread_post_hook_type prev_mdread_post_hook = NULL;
+static mdwrite_pre_hook_type prev_mdwrite_pre_hook = NULL;
+static mdextend_pre_hook_type prev_mdextend_pre_hook = NULL;
+static xlog_insert_pre_hook_type prev_xlog_insert_pre_hook = NULL;
+static xlog_decode_pre_hook_type prev_xlog_decode_pre_hook = NULL;
+
+/* ----------
+ * Function declarations
+ * ----------
+ */
+
+/* Module entry points */
+void _PG_init(void);
+void _PG_fini(void);
+
+/* GUC callbacks */
+static bool check_test_tde_key(char **newval, void **extra, GucSource source);
+
+/* Hook functions */
+static void test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks);
+static const void **test_tde_mdwrite_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+static const void *test_tde_mdextend_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+static struct XLogRecData *test_tde_xlog_insert_pre(struct XLogRecData *rdata);
+static XLogRecord *test_tde_xlog_decode_pre(XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Internal helper functions */
+static void ensure_encrypt_buffer(BlockNumber nblocks);
+static bool parse_hex_key(const char *hex, unsigned char *out, int outlen);
+static void derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn);
+static void transform_data(const unsigned char *in, unsigned char *out,
+ int len, const unsigned char *iv);
+static bool should_transform(RelFileLocator *rlocator, ForkNumber forknum);
+
+
+/* ----------
+ * Internal helper functions
+ * ----------
+ */
+
+/*
+ * Parse hex string to bytes
+ */
+static bool
+parse_hex_key(const char *hex, unsigned char *out, int outlen)
+{
+ int i;
+ int hexlen;
+
+ if (hex == NULL)
+ return false;
+
+ hexlen = strlen(hex);
+ if (hexlen != outlen * 2)
+ return false;
+
+ for (i = 0; i < outlen; i++)
+ {
+ int hi,
+ lo;
+ char c;
+
+ c = hex[i * 2];
+ if (c >= '0' && c <= '9')
+ hi = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ hi = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ hi = c - 'A' + 10;
+ else
+ return false;
+
+ c = hex[i * 2 + 1];
+ if (c >= '0' && c <= '9')
+ lo = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ lo = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ lo = c - 'A' + 10;
+ else
+ return false;
+
+ out[i] = (hi << 4) | lo;
+ }
+
+ return true;
+}
+
+/*
+ * Ensure encrypt buffer can hold 'nblocks' pages.
+ * Grows by 2x when needed. Uses test_tde_cxt for persistence.
+ */
+static void
+ensure_encrypt_buffer(BlockNumber nblocks)
+{
+ if (encrypt_buffer_nblocks >= nblocks)
+ return;
+
+ if (encrypt_buffer == NULL)
+ {
+ BlockNumber initial = Max(8, nblocks);
+ Size size = (Size) initial * BLCKSZ;
+
+ encrypt_buffer = MemoryContextAllocAligned(test_tde_cxt, size,
+ PG_IO_ALIGN_SIZE, 0);
+ encrypt_buffer_ptrs = MemoryContextAlloc(test_tde_cxt,
+ initial * sizeof(void *));
+ encrypt_buffer_nblocks = initial;
+ }
+ else
+ {
+ BlockNumber new_nblocks = encrypt_buffer_nblocks;
+ Size new_size;
+
+ while (new_nblocks < nblocks)
+ new_nblocks *= 2;
+
+ new_size = (Size) new_nblocks * BLCKSZ;
+
+ /* repalloc doesn't preserve alignment, so allocate new and copy */
+ {
+ char *new_buffer = MemoryContextAllocAligned(test_tde_cxt,
+ new_size,
+ PG_IO_ALIGN_SIZE, 0);
+
+ memcpy(new_buffer, encrypt_buffer,
+ (Size) encrypt_buffer_nblocks * BLCKSZ);
+ pfree(encrypt_buffer);
+ encrypt_buffer = new_buffer;
+ }
+
+ encrypt_buffer_ptrs = repalloc(encrypt_buffer_ptrs,
+ new_nblocks * sizeof(void *));
+ encrypt_buffer_nblocks = new_nblocks;
+ }
+
+ /* Update pointers array */
+ for (BlockNumber i = 0; i < encrypt_buffer_nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_buffer + (Size) i * BLCKSZ;
+}
+
+
+/*
+ * Derive IV from page location and header
+ *
+ * IV structure (16 bytes) - simple, deterministic layout:
+ *
+ * AES-CTR mode only requires IV uniqueness, not randomness.
+ * The combination of LSN + RelFileNumber + BlockNumber guarantees uniqueness:
+ * - LSN: Globally unique across entire WAL stream
+ * - RelFileNumber: Unique within database
+ * - BlockNumber: Unique within relation
+ *
+ * Even when a single WAL record modifies multiple pages (e.g., B-tree split),
+ * the BlockNumber distinguishes each page.
+ *
+ * Layout (high entropy bytes first, low entropy bytes last for CTR counter space):
+ * [0-3] LSN low 32 bits - changes frequently (high entropy)
+ * [4-5] LSN bits 32-47 - mid entropy
+ * [6-8] BlockNumber low 24 bits
+ * [9-11] RelFileNumber low 24 bits
+ * [12] BlockNumber high 8 bits - usually 0 for small tables
+ * [13] RelFileNumber high 8 bits - usually 0
+ * [14-15] LSN bits 48-63 - usually 0, counter space for CTR
+ *
+ * CTR counter space analysis:
+ * - Page size: 8KB, encrypted area: 8168 bytes (excluding 24-byte header)
+ * - AES block size: 16 bytes
+ * - Counter increments per page: 8168/16 = 511 (0x1FF)
+ * - Counter affects only IV[14-15] (max increment 0x1FF < 0x10000)
+ * - Bytes 12-15 provide 2^32 counter space, far exceeding 511 needed
+ * - Collision requires same IV[0-11], which means same LSN+BlockNum+RelNum
+ *
+ * Note: spcOid, dbOid not used - RelFileNumber is sufficient for uniqueness.
+ *
+ * Known limitation: Operations that copy/move files while changing
+ * RelFileNumber without going through storage hooks cause decryption failure.
+ */
+static void
+derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn)
+{
+
+ /*
+ * Layout: High entropy first, low entropy (usually 0) last.
+ * [LSN low 4B][LSN mid 2B][BlockNum low 3B][RelNum low 3B]
+ * [BlockNum high 1B][RelNum high 1B][LSN high 2B]
+ */
+
+ /* LSN low 32 bits - bytes 0-3 (high entropy, changes frequently) */
+ iv[0] = (uint8) ((lsn >> 0) & 0xFF);
+ iv[1] = (uint8) ((lsn >> 8) & 0xFF);
+ iv[2] = (uint8) ((lsn >> 16) & 0xFF);
+ iv[3] = (uint8) ((lsn >> 24) & 0xFF);
+
+ /* LSN bits 32-47 - bytes 4-5 (mid entropy) */
+ iv[4] = (uint8) ((lsn >> 32) & 0xFF);
+ iv[5] = (uint8) ((lsn >> 40) & 0xFF);
+
+ /* BlockNumber low 24 bits - bytes 6-8 */
+ iv[6] = (uint8) ((blocknum >> 0) & 0xFF);
+ iv[7] = (uint8) ((blocknum >> 8) & 0xFF);
+ iv[8] = (uint8) ((blocknum >> 16) & 0xFF);
+
+ /* RelFileNumber low 24 bits - bytes 9-11 */
+ iv[9] = (uint8) ((rlocator->relNumber >> 0) & 0xFF);
+ iv[10] = (uint8) ((rlocator->relNumber >> 8) & 0xFF);
+ iv[11] = (uint8) ((rlocator->relNumber >> 16) & 0xFF);
+
+ /* BlockNumber high 8 bits - byte 12 (usually 0 for small tables) */
+ iv[12] = (uint8) ((blocknum >> 24) & 0xFF);
+
+ /* RelFileNumber high 8 bits - byte 13 (usually 0) */
+ iv[13] = (uint8) ((rlocator->relNumber >> 24) & 0xFF);
+
+ /* LSN bits 48-63 - bytes 14-15 (usually 0, counter space for CTR) */
+ iv[14] = (uint8) ((lsn >> 48) & 0xFF);
+ iv[15] = (uint8) ((lsn >> 56) & 0xFF);
+}
+
+/*
+ * Encrypt or decrypt data using AES-256-CTR
+ *
+ * AES-CTR is symmetric: encrypt and decrypt use the same operation.
+ */
+static void
+transform_data(const unsigned char *in, unsigned char *out, int len,
+ const unsigned char *iv)
+{
+ int outlen,
+ tmplen;
+
+ if (len <= 0)
+ return;
+
+ /*
+ * cipher_ctx is pre-allocated and initialized with cipher/key in _PG_init().
+ * Here we only set IV (cipher=NULL, key=NULL), which avoids internal
+ * memory allocation. This is critical for WAL encryption which runs
+ * inside critical sections. We use PANIC for all errors.
+ */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: cipher context not initialized")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, NULL, NULL, NULL, iv) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptInit_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptUpdate(cipher_ctx, out, &outlen, in, len) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptUpdate failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptFinal_ex(cipher_ctx, out + outlen, &tmplen) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptFinal_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+}
+
+/*
+ * Check if we should encrypt/decrypt this relation
+ *
+ * For this test implementation, we encrypt only user-created relations.
+ * A production implementation would check encryption policies.
+ */
+static bool
+should_transform(RelFileLocator *rlocator, ForkNumber forknum)
+{
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return false;
+
+ /* Skip system catalog tablespace (pg_global) */
+ if (rlocator->spcOid == GLOBALTABLESPACE_OID)
+ return false;
+
+ /*
+ * Skip system catalogs (OID < FirstNormalObjectId). This ensures we don't
+ * try to encrypt/decrypt pre-existing system catalog pages that were
+ * created without encryption.
+ */
+ if (rlocator->relNumber < FirstNormalObjectId)
+ return false;
+
+ (void) forknum; /* all forks are encrypted for user tables */
+
+ return true;
+}
+
+
+/* ----------
+ * Hook functions - Page I/O
+ * ----------
+ */
+
+/*
+ * Post-read hook: decrypt blocks after reading from disk
+ */
+static void
+test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+ unsigned char iv[16];
+
+ /* Chain to previous hook if any */
+ if (prev_mdread_post_hook)
+ prev_mdread_post_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ {
+ PageHeader phdr = (PageHeader) buffers[i];
+ uint16 checksum;
+ uint8 transform_id;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffers[i]))
+ continue;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ continue;
+
+ /* Check transform ID - skip if page is not encrypted by us */
+ transform_id = PageGetTransformId((Page) buffers[i]);
+ if (transform_id == PD_TRANSFORM_NONE)
+ continue; /* Page is not encrypted */
+
+ if (transform_id != TEST_TDE_TRANSFORM_ID)
+ {
+ elog(DEBUG1, "test_tde: skipping block %u with transform ID %u (not ours)",
+ blocknum + i, transform_id);
+ continue;
+ }
+
+ /* Page is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted page found but encryption key not configured"),
+ errdetail("Block %u of relation %u/%u/%u fork %d has transform ID %u.",
+ blocknum + i, rlocator->spcOid, rlocator->dbOid,
+ rlocator->relNumber, forknum, transform_id)));
+
+ /* Verify checksum on encrypted data before decryption */
+ if (DataChecksumsEnabled())
+ {
+ checksum = pg_checksum_page((char *) buffers[i], blocknum + i);
+ if (checksum != phdr->pd_checksum)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("page verification failed, calculated checksum %u but expected %u",
+ checksum, phdr->pd_checksum)));
+ }
+ }
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum + i, PageGetLSN((Page) buffers[i]));
+
+ /* Decrypt data area in place (header stays unchanged) */
+ transform_data((unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ (unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Clear transform ID and recalculate checksum for plaintext data */
+ PageSetTransformId((Page) buffers[i], PD_TRANSFORM_NONE);
+ PageSetChecksumInplace((Page) buffers[i], blocknum + i);
+ }
+}
+
+/*
+ * Helper: encrypt a single page into the encrypt_buffer at given offset.
+ * Returns pointer to encrypted page, or original buffer if page was skipped.
+ */
+static const void *
+encrypt_page(RelFileLocator *rlocator, BlockNumber blocknum,
+ const void *buffer, Size buffer_offset)
+{
+ unsigned char iv[16];
+ PageHeader phdr = (PageHeader) buffer;
+ char *dest = encrypt_buffer + buffer_offset;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffer))
+ return buffer;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ return buffer;
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum, PageGetLSN((Page) buffer));
+
+ /* Copy header, encrypt data area */
+ memcpy(dest, buffer, SizeOfPageHeaderData);
+ transform_data((unsigned char *) buffer + SizeOfPageHeaderData,
+ (unsigned char *) dest + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Set transform ID to mark page as encrypted */
+ PageSetTransformId((Page) dest, TEST_TDE_TRANSFORM_ID);
+
+ /* Recalculate checksum for encrypted data */
+ PageSetChecksumInplace((Page) dest, blocknum);
+
+ return dest;
+}
+
+/*
+ * Pre-write hook: encrypt blocks before writing to disk
+ */
+static const void **
+test_tde_mdwrite_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+
+ /* Chain to previous hook if any */
+ if (prev_mdwrite_pre_hook)
+ buffers = prev_mdwrite_pre_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ if (!should_transform(rlocator, forknum))
+ return buffers;
+
+ /* Ensure buffer is large enough */
+ ensure_encrypt_buffer(nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_page(rlocator, blocknum + i,
+ buffers[i], (Size) i * BLCKSZ);
+
+ return encrypt_buffer_ptrs;
+}
+
+/*
+ * Pre-extend hook: encrypt block before extending relation
+ */
+static const void *
+test_tde_mdextend_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void *buffer)
+{
+ /* Chain to previous hook if any */
+ if (prev_mdextend_pre_hook)
+ buffer = prev_mdextend_pre_hook(rlocator, forknum, blocknum, buffer);
+
+ if (!should_transform(rlocator, forknum))
+ return buffer;
+
+ /* Ensure buffer is large enough for at least 1 block */
+ ensure_encrypt_buffer(1);
+
+ return encrypt_page(rlocator, blocknum, buffer, 0);
+}
+
+
+/* ----------
+ * Hook functions - WAL I/O
+ * ----------
+ */
+
+/*
+ * Ensure WAL encryption buffer is large enough.
+ * Uses test_tde_cxt which allows allocation in critical sections.
+ */
+static void
+ensure_wal_encrypt_buffer(Size needed)
+{
+ if (wal_encrypt_buffer_size >= needed)
+ return;
+
+ if (wal_encrypt_buffer == NULL)
+ wal_encrypt_buffer = MemoryContextAlloc(test_tde_cxt, needed);
+ else
+ wal_encrypt_buffer = repalloc(wal_encrypt_buffer, needed);
+ wal_encrypt_buffer_size = needed;
+}
+
+/*
+ * WAL insert pre-hook: encrypt WAL record data
+ *
+ * Strategy:
+ * 1. Copy XLogRecord header and payload
+ * 2. Save plaintext CRC from header (xl_crc contains payload CRC at this point)
+ * 3. Build IV: [plaintext CRC (4B)] [random (12B)]
+ * 4. Insert transformation header (block ID 251 + payload_length) and IV
+ * 5. Encrypt original payload with the IV
+ * 6. Update xl_tot_len and recalculate CRC for encrypted payload
+ *
+ * Resulting record structure:
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV 16B]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as encrypted. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ */
+static struct XLogRecData *
+test_tde_xlog_insert_pre(struct XLogRecData *rdata)
+{
+ XLogRecData *node;
+ XLogRecord *rechdr;
+ char *bufptr;
+ char *new_payload_start;
+ uint32 orig_total_len;
+ uint32 orig_payload_len;
+ uint32 new_total_len;
+ uint32 transform_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ pg_crc32c plaintext_crc;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_insert_pre_hook)
+ rdata = prev_xlog_insert_pre_hook(rdata);
+
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return rdata;
+
+ /* First node must contain XLogRecord header */
+ if (rdata == NULL || rdata->data == NULL || rdata->len < SizeOfXLogRecord)
+ return rdata;
+
+ rechdr = (XLogRecord *) rdata->data;
+ orig_total_len = rechdr->xl_tot_len;
+ orig_payload_len = orig_total_len - SizeOfXLogRecord;
+
+ /* Sanity check */
+ if (orig_total_len < SizeOfXLogRecord)
+ return rdata;
+
+ /*
+ * Skip records with no payload (e.g., XLOG_SWITCH). These are header-only
+ * records where adding encryption overhead would break size assertions.
+ */
+ if (orig_payload_len == 0)
+ return rdata;
+
+ new_total_len = orig_total_len + WAL_ENCRYPT_OVERHEAD;
+
+ /*
+ * Save plaintext CRC before we modify anything.
+ * At this point, xl_crc contains the CRC of the payload only
+ * (header CRC is added later by XLogInsertRecord).
+ */
+ plaintext_crc = rechdr->xl_crc;
+
+ /*
+ * Ensure buffer is large enough. test_tde_cxt allows allocation in
+ * critical sections, so this is safe even during WAL insertion.
+ * OOM here will cause PANIC, which is acceptable for critical sections.
+ */
+ ensure_wal_encrypt_buffer(new_total_len);
+
+ /*
+ * Build IV: [plaintext CRC (4B)] [random (12B)]
+ * Store CRC directly in IV[0..3] (little-endian).
+ */
+ iv[0] = ((uint32) plaintext_crc >> 0) & 0xFF;
+ iv[1] = ((uint32) plaintext_crc >> 8) & 0xFF;
+ iv[2] = ((uint32) plaintext_crc >> 16) & 0xFF;
+ iv[3] = ((uint32) plaintext_crc >> 24) & 0xFF;
+
+ /* Generate random bytes for IV[4..15] (12 bytes) for uniqueness */
+ if (!pg_strong_random(iv + WAL_CRC_SIZE, WAL_IV_RANDOM_SIZE))
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: failed to generate random IV for WAL")));
+ return rdata;
+ }
+
+ /*
+ * Build encrypted record in buffer:
+ * [header][block_id][payload_length][IV][encrypted_payload]
+ */
+ bufptr = wal_encrypt_buffer;
+
+ /* 1. Copy header from first rdata node */
+ memcpy(bufptr, rdata->data, SizeOfXLogRecord);
+ bufptr += SizeOfXLogRecord;
+
+ /* 2. Insert transformation header (block ID 251 + payload_length) */
+ new_payload_start = bufptr;
+ *bufptr = (char) XLR_BLOCK_ID_TRANSFORMED;
+ bufptr += sizeof(uint8);
+
+ /* Calculate payload_length: IV + encrypted payload */
+ transform_payload_len = WAL_ENCRYPT_IV_SIZE + orig_payload_len;
+
+ /* Store payload_length (4 bytes, unaligned, little-endian) */
+ bufptr[0] = (char) ((transform_payload_len >> 0) & 0xFF);
+ bufptr[1] = (char) ((transform_payload_len >> 8) & 0xFF);
+ bufptr[2] = (char) ((transform_payload_len >> 16) & 0xFF);
+ bufptr[3] = (char) ((transform_payload_len >> 24) & 0xFF);
+ bufptr += sizeof(uint32);
+
+ /* 3. Insert IV (CRC in first 4 bytes, random in remaining 12) */
+ memcpy(bufptr, iv, WAL_ENCRYPT_IV_SIZE);
+ bufptr += WAL_ENCRYPT_IV_SIZE;
+
+ /* 4. Copy payload to buffer, then encrypt in-place */
+ if (orig_payload_len > 0)
+ {
+ Size first_node_payload;
+ char *encrypt_start = bufptr;
+
+ /* First node: skip header, copy remaining payload */
+ first_node_payload = rdata->len - SizeOfXLogRecord;
+ if (first_node_payload > 0)
+ {
+ memcpy(bufptr, (char *) rdata->data + SizeOfXLogRecord, first_node_payload);
+ bufptr += first_node_payload;
+ }
+
+ /* Remaining nodes: copy all data */
+ for (node = rdata->next; node != NULL; node = node->next)
+ {
+ if (node->len > 0 && node->data != NULL)
+ {
+ memcpy(bufptr, node->data, node->len);
+ bufptr += node->len;
+ }
+ }
+
+ /* Encrypt payload in-place */
+ transform_data((unsigned char *) encrypt_start,
+ (unsigned char *) encrypt_start,
+ orig_payload_len, iv);
+ }
+
+ /* Update header with new total length */
+ rechdr = (XLogRecord *) wal_encrypt_buffer;
+ rechdr->xl_tot_len = new_total_len;
+
+ /*
+ * Recalculate CRC for the new payload (marker + length + IV + encrypted data).
+ * The header CRC will be added by XLogInsertRecord later.
+ */
+ {
+ pg_crc32c crc;
+
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, new_payload_start, new_total_len - SizeOfXLogRecord);
+ rechdr->xl_crc = crc;
+ }
+
+ /* Return single XLogRecData pointing to our encrypted buffer */
+ wal_rdata_head.next = NULL;
+ wal_rdata_head.data = wal_encrypt_buffer;
+ wal_rdata_head.len = new_total_len;
+
+ return &wal_rdata_head;
+}
+
+/*
+ * WAL decode pre-hook: decrypt WAL record data
+ *
+ * This reverses the encryption done in xlog_insert_pre_hook.
+ * Checks for block ID 251 marker to identify encrypted records.
+ *
+ * Input: [header] [block_id=251 (1B)] [payload_length (4B)] [IV 16B] [encrypted payload]
+ * Output: [header] [original payload] (shorter by 21 bytes)
+ *
+ * Recovery process:
+ * 1. Check for encryption marker (block ID 251)
+ * 2. Read payload_length from transform header
+ * 3. Extract IV for decryption
+ * 4. Decrypt payload using IV
+ * 5. Extract plaintext payload CRC from IV[0..3]
+ * 6. Restore original record structure
+ *
+ * If the marker is not found, record is not encrypted (pass through).
+ * If inplace_allowed, decrypts in place. Otherwise, copies to static buffer.
+ */
+static XLogRecord *
+test_tde_xlog_decode_pre(XLogReaderState *state, XLogRecord *record,
+ XLogRecPtr lsn, bool inplace_allowed)
+{
+ uint32 total_len;
+ uint32 transform_payload_len;
+ uint32 encrypted_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ char *payload_start;
+ char *len_ptr;
+ XLogRecord *work_record;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_decode_pre_hook)
+ record = prev_xlog_decode_pre_hook(state, record, lsn, inplace_allowed);
+
+ if (record == NULL)
+ return record;
+
+ total_len = record->xl_tot_len;
+
+ /* Must have at least header + transform header + IV */
+ if (total_len < SizeOfXLogRecord + WAL_ENCRYPT_OVERHEAD)
+ return record;
+
+ /* Check for transformation marker (block ID 251) */
+ payload_start = (char *) record + SizeOfXLogRecord;
+ if ((unsigned char) *payload_start != XLR_BLOCK_ID_TRANSFORMED)
+ return record; /* Not transformed, pass through */
+
+ /* WAL is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted WAL record found but encryption key not configured"),
+ errdetail("WAL record at LSN %X/%X has transformation marker.",
+ LSN_FORMAT_ARGS(lsn))));
+
+ /*
+ * If inplace modification allowed, work directly on record. Otherwise,
+ * copy to static buffer (record fits in single page).
+ */
+ if (inplace_allowed)
+ {
+ work_record = record;
+ }
+ else
+ {
+ /* Record within single page, must fit in XLOG_BLCKSZ */
+ if (total_len > XLOG_BLCKSZ)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: WAL record too large for decryption buffer")));
+ return record;
+ }
+ memcpy(wal_decrypt_buffer, record, total_len);
+ work_record = (XLogRecord *) wal_decrypt_buffer;
+ }
+
+ /* Recalculate payload_start for work_record */
+ payload_start = (char *) work_record + SizeOfXLogRecord;
+
+ /* Read payload_length from transform header (4 bytes, unaligned, little-endian) */
+ len_ptr = payload_start + sizeof(uint8);
+ transform_payload_len = ((uint32) (unsigned char) len_ptr[0] << 0) |
+ ((uint32) (unsigned char) len_ptr[1] << 8) |
+ ((uint32) (unsigned char) len_ptr[2] << 16) |
+ ((uint32) (unsigned char) len_ptr[3] << 24);
+
+ /* Validate payload_length */
+ if (transform_payload_len < WAL_ENCRYPT_IV_SIZE ||
+ transform_payload_len > total_len - SizeOfXLogRecord - SizeOfXLogRecordDataHeaderLong)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: invalid transform payload length %u at LSN %X/%X",
+ transform_payload_len, LSN_FORMAT_ARGS(lsn))));
+ return record;
+ }
+
+ /* Extract IV (after transform header) */
+ memcpy(iv, payload_start + SizeOfXLogRecordDataHeaderLong, WAL_ENCRYPT_IV_SIZE);
+
+ /* Encrypted payload length = transform_payload_len - IV */
+ encrypted_payload_len = transform_payload_len - WAL_ENCRYPT_IV_SIZE;
+
+ /*
+ * Decrypt payload directly to payload_start position, removing header and IV.
+ * Source: payload_start + 21 (encrypted data after transform header + IV)
+ * Dest: payload_start (overwrite transform header with decrypted data)
+ */
+ if (encrypted_payload_len > 0)
+ {
+ transform_data((unsigned char *) (payload_start + WAL_ENCRYPT_OVERHEAD),
+ (unsigned char *) payload_start,
+ encrypted_payload_len, iv);
+ }
+
+ /* Update header with original length (transform header and IV removed) */
+ work_record->xl_tot_len = SizeOfXLogRecord + encrypted_payload_len;
+
+ /*
+ * Recover plaintext payload CRC from IV[0..3] (little-endian).
+ */
+ {
+ pg_crc32c recovered_payload_crc;
+ pg_crc32c full_crc;
+
+ /* Extract CRC directly from IV[0..3] */
+ recovered_payload_crc = (pg_crc32c) (((uint32) iv[0] << 0) |
+ ((uint32) iv[1] << 8) |
+ ((uint32) iv[2] << 16) |
+ ((uint32) iv[3] << 24));
+
+ /*
+ * For ValidXLogRecord(), we need CRC of: payload + header (up to xl_crc)
+ * The recovered CRC is payload-only, so add header portion.
+ */
+ full_crc = recovered_payload_crc;
+ COMP_CRC32C(full_crc, (char *) work_record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32C(full_crc);
+ work_record->xl_crc = full_crc;
+ }
+
+ return work_record;
+}
+
+
+/* ----------
+ * GUC callbacks
+ * ----------
+ */
+
+/*
+ * GUC check hook for key
+ */
+static bool
+check_test_tde_key(char **newval, void **extra, GucSource source)
+{
+ if (*newval == NULL || strlen(*newval) == 0)
+ return true;
+
+ if (strlen(*newval) != 64)
+ {
+ GUC_check_errdetail("Key must be exactly 64 hex characters (256 bits).");
+ return false;
+ }
+
+ /* Validate hex characters */
+ for (int i = 0; i < 64; i++)
+ {
+ char c = (*newval)[i];
+
+ if (!((c >= '0' && c <= '9') ||
+ (c >= 'a' && c <= 'f') ||
+ (c >= 'A' && c <= 'F')))
+ {
+ GUC_check_errdetail("Key must contain only hex characters (0-9, a-f, A-F).");
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/* ----------
+ * Module entry points
+ * ----------
+ */
+
+/*
+ * Module initialization
+ */
+void
+_PG_init(void)
+{
+ unsigned char key[32];
+
+ /*
+ * Create memory context for encryption buffers and allow allocation
+ * in critical sections. This is necessary because WAL encryption runs
+ * inside critical sections, and OOM there will cause PANIC anyway.
+ */
+ test_tde_cxt = AllocSetContextCreate(TopMemoryContext,
+ "test_tde",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextAllowInCriticalSection(test_tde_cxt, true);
+
+ /*
+ * Define GUC for encryption key.
+ *
+ * PGC_POSTMASTER: Key can only be set at server start to prevent
+ * accidental runtime changes.
+ *
+ * WARNING: Once data is encrypted with a key, that same key MUST be used
+ * for the lifetime of the data. Changing the key (even across restarts)
+ * will cause decryption failures and data corruption. This reference
+ * implementation does not support key rotation.
+ */
+ DefineCustomStringVariable("test_tde.key",
+ "Encryption key in hex format (64 characters = 256 bits).",
+ "WARNING: Key must never change once data is encrypted!",
+ &test_tde_key_hex,
+ "",
+ PGC_POSTMASTER,
+ GUC_SUPERUSER_ONLY,
+ check_test_tde_key,
+ NULL,
+ NULL);
+
+ MarkGUCPrefixReserved("test_tde");
+
+ /*
+ * Parse key and initialize cipher context if key is configured.
+ * cipher_ctx remains NULL if no key is set, disabling encryption.
+ */
+ if (test_tde_key_hex != NULL && strlen(test_tde_key_hex) == 64)
+ {
+ if (!parse_hex_key(test_tde_key_hex, key, 32))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("test_tde: failed to parse encryption key")));
+
+ cipher_ctx = EVP_CIPHER_CTX_new();
+ if (!cipher_ctx)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("test_tde: failed to create cipher context")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, EVP_aes_256_ctr(), NULL, key, NULL) != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: failed to initialize cipher context")));
+
+ /* Clear key from stack */
+ explicit_bzero(key, sizeof(key));
+ }
+
+ /* Install hooks (save previous values for chaining) */
+ prev_mdread_post_hook = mdread_post_hook;
+ mdread_post_hook = test_tde_mdread_post;
+
+ prev_mdwrite_pre_hook = mdwrite_pre_hook;
+ mdwrite_pre_hook = test_tde_mdwrite_pre;
+
+ prev_mdextend_pre_hook = mdextend_pre_hook;
+ mdextend_pre_hook = test_tde_mdextend_pre;
+
+ prev_xlog_insert_pre_hook = xlog_insert_pre_hook;
+ xlog_insert_pre_hook = test_tde_xlog_insert_pre;
+
+ prev_xlog_decode_pre_hook = xlog_decode_pre_hook;
+ xlog_decode_pre_hook = test_tde_xlog_decode_pre;
+
+ ereport(LOG,
+ (errmsg("test_tde: initialized (WARNING: for testing only!)")));
+}
+
+/*
+ * Module finalization
+ */
+void
+_PG_fini(void)
+{
+ /* Restore previous hooks */
+ xlog_decode_pre_hook = prev_xlog_decode_pre_hook;
+ xlog_insert_pre_hook = prev_xlog_insert_pre_hook;
+ mdextend_pre_hook = prev_mdextend_pre_hook;
+ mdwrite_pre_hook = prev_mdwrite_pre_hook;
+ mdread_post_hook = prev_mdread_post_hook;
+
+ /* Free OpenSSL cipher context (also clears key material) */
+ if (cipher_ctx != NULL)
+ {
+ EVP_CIPHER_CTX_free(cipher_ctx);
+ cipher_ctx = NULL;
+ }
+
+ /*
+ * Delete memory context - this frees all buffers allocated from it
+ * (encrypt_buffer, encrypt_buffer_ptrs, wal_encrypt_buffer).
+ */
+ if (test_tde_cxt != NULL)
+ {
+ MemoryContextDelete(test_tde_cxt);
+ test_tde_cxt = NULL;
+ }
+
+ /* Reset buffer pointers */
+ encrypt_buffer = NULL;
+ encrypt_buffer_ptrs = NULL;
+ encrypt_buffer_nblocks = 0;
+ wal_encrypt_buffer = NULL;
+ wal_encrypt_buffer_size = 0;
+}
diff --git a/contrib/test_tde/test_tde.conf b/contrib/test_tde/test_tde.conf
new file mode 100644
index 00000000000..0b00366474c
--- /dev/null
+++ b/contrib/test_tde/test_tde.conf
@@ -0,0 +1,2 @@
+shared_preload_libraries = 'test_tde'
+test_tde.key = '0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef'
--
2.50.1 (Apple Git-155)
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Konstantin,
I have great respect for the work being done on the extensible SMGR API.
It is a powerful tool for use cases that require replacing the entire
storage layer (like Neon's architecture).
However, I believe we should distinguish between Storage Management
(where/how data is stored) and Data Transformation (what the data looks
like). I see a strong case for both approaches to coexist for the
following practical reasons:
1. Separation of Concerns and Safety
Is it reasonable to ask cryptography experts to clone the entire SMGR
implementation and maintain code they don't fully understand just to
insert encryption logic? If an extension developer clones md.c to add
encryption, they become responsible for the fundamental integrity of
PostgreSQL's file I/O. Any bug in their cloned storage logic could lead
to data loss unrelated to encryption itself.
2. The Maintenance Debt of "Cloning"
When md.c receives critical security patches or bug fixes in the core,
every TDE extension maintainer would need to manually backport those
changes to their specific SMGR implementation. This creates a fragmented
ecosystem where security extensions might actually introduce storage
vulnerabilities by running outdated cloned logic.
3. Minimalist Integration
The hook approach allows crypto experts to focus strictly on transform()
and reverse_transform(). The complex storage orchestration remains with
the PostgreSQL core where it is most rigorously tested. This is a cleaner
separation of responsibilities: the core provides the trusted pipeline,
and the extension provides the specialized transformation.
Conclusion:
I believe these hooks provide a "low-barrier, high-safety" path for data
transformation that the SMGR API—by its very nature of being a full
replacement—cannot easily provide. Let's provide the SMGR for those who
want to reinvent the storage, and hooks for those who simply want to
secure the data.
Best regards,
Henson Choi
2025년 12월 28일 (일) PM 9:11, Konstantin Knizhnik <knizhnik@garret.ru>님이 작성:
Show quoted text
On 28/12/2025 9:49 AM, Henson Choi wrote:
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & MotivationThis RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert GroupsThe core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were mired
in the technicalities of “how to hardcode encryption into the core,” this
proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”- *RDBMS Experts* provide a trusted pipeline responsible for data I/O
paths and consistency.
- *Security Experts* take responsibility for the specialized domain of
encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data securely
without encroaching on each other’s territory.
------------------------------
2. Design Principles1. *Delegation of Authority:* The core remains independent of specific
encryption standards, providing a “free territory” where security experts
can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks are
not in use, ensuring the native performance of the PostgreSQL engine.------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)We allow intervention by security experts through five contact points
along the I/O path:- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a page.
This serves as a *Status Message* sent by the security expert to the
engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of Conduct
for Security ExpertsThis reference implementation exists not as a commercial product, but to
define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to maintain
system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that allows
peaceful coexistence with other expert tools (e.g., compression, auditing).------------------------------
5. Scope- *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection of
specific cryptographic algorithms, and integration with external tools.This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can perform
at their best.I wonder if instead of support a lot of extra hooks it will be better to
provide extensible SMGR API:/messages/by-id/CAPP=Hha_wV1MV9yR70QZ5pk5dtNP+bOyBiFxPmrMKqnQeKMAwQ@mail.gmail.com
It seems to be much more straightforward, convenient and flexible
mechanism than adding hooks, which can be used for many other purposes
except transparent encryption.
Is it reasonable to ask cryptography experts to clone the entire SMGR
implementation and maintain code they don't fully understand just to
insert encryption logic?
You don't have to clone the md.c logic with the recent smgr extension
patch, it does the same thing your patch does: it lets you hook into
it while still keeping the original md.c implementation. The
difference is that it doesn't add additional hooks to the API, instead
it makes all of the existing smgr/md.c functions hooks.
This also means that it lets different extensions work together in a
more generic way. For example an extension that wants to retrieve data
files from cloud storage when needed (prepending the original md.c
logic), and an encryption extension that wants to decrypt data after
loading it (appending to the original md.c logic) can both work
together while keeping the original logic in place.
Or if it's about mdwritev, in this patch you added a new
mdwrite_pre_hook - but it is executed at a specific point during
mdwrite. In the generic smgr patch, mdwritev itself (or smgr_writev
more specifically) is a hook, you can change it, and then call the
previous implementation (typically mdwritev) when you want it, either
before or after your custom code.
(the latest submitted version of the smgr patch doesn't use typical
postgres-style hooks, but that's one of the things we probably should
change. The intention is the same)
There's no maintenance fee of cloning, because neither extension
cloned the original md.c logic, both extended it.
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Zsolt,
Thank you for your detailed questions. I'll address each point:
1. Bundling WAL and Buffer Manager
WAL and heap pages are simply different representations of the same
underlying data. Protecting only one side would be cryptographically
incomplete; an attacker could bypass encryption by reading the
unprotected side. Therefore, they must be treated as a single atomic
unit of protection.
2. Scope: Temporary Files, System Tables, and Frontend Tools
I intentionally kept the scope focused. Past TDE proposals often stalled
because they tried to solve everything at once, becoming too large to
review. I prefer a "divide-and-conquer" approach:
- Temporary files: Out of scope for this initial infrastructure proposal.
- System tables: While they cannot be encrypted during bootstrap (since
extensions aren't loaded), they can be transformed page-by-page during
normal operation.
- Frontend tools (pg_waldump, etc.): I am aware of this and have modified
versions. Currently, there is no standard mechanism for frontend hooks,
making this a broader challenge. For production, extensions could ship
their own modified frontend tools temporarily. Long-term, we may need
initdb-time configurations to unify backend/frontend hook behavior
that are fixed for the lifetime of the cluster.
3. Why Hooks Instead of SMGR
Please see my response to Konstantin in this thread regarding maintenance
debt and the "Separation of Concerns" between storage management and data
transformation.
4. Page Header Flags vs. Fork Files
My primary concern with using fork files for encryption metadata is crash
recovery. If a fork file and the actual data page become inconsistent
(e.g., during a crash), recovery becomes problematic because fork files
are not typically protected by WAL.
Storing the Transform ID in the header flags ensures that the metadata
travels with the page. This is essential for incremental key rotation,
where pages are gradually re-encrypted with newer keys over time. The
oldest key's pages are force-rotated, allowing continuous key rotation
without service interruption. I plan to propose a separate RFC for this
"gradual rotation" mechanism.
5. Benchmarks and Critical Section Overhead
Transformation happens inside the critical section but before acquiring
the WAL lock. On consumer-grade SSDs, the encryption latency is largely
masked by I/O wait times with negligible performance impact. On
high-performance storage (production SSDs, Apple Silicon, etc.), the
reduced I/O wait exposes the encryption overhead, which is visible but
modest. Detailed benchmarks require company approval - I will follow up
later.
Best regards,
Henson Choi
2025년 12월 28일 (일) PM 10:12, Zsolt Parragi <zsolt.parragi@percona.com>님이 작성:
Show quoted text
Hello!
I am glad to see that there are multiple TDE extension proposals being
worked on. For context, I am one of the developers working on the
pg_tde[1] extension, as well as on the extensible SMGR proposal that
Konstantin already linked.This patch/proposal contains two distinct parts of
encryption/extensibility, WAL and buffer manager/table data. Based on
earlier discussions, the opinions of adding extension points to these
two are quite different, and because of that I'm not sure if bundling
them together is helpful.It also appears to be missing some extension points that would be
required for a more complete encryption solution, such as encrypting
temporary files or system tables, or handling command-line utilities
like pg_waldump. Do you have ideas or patches in mind for those areas
as well?I have the same question as Konstantin, why did you choose custom
hooks for the buffer manager instead of the already existing smgr
interface / extensibility patch? While that patch is not part of the
core (but I hope it will be), it is already used by multiple companies
as it supports other use cases, not only encryption. We plan to focus
more on that thread early next year, we would appreciate any
feedback/suggestions that could make it better for others.I also noticed that you added additional flags to the page header.
Initially we were thinking about something like this, but decided that
the fork files are better for any encryption (or other storage
related) extra data. These few bits try to be generic, while also
restrictive because of the limited amount of data. (and that data is
specifically per page, if I want something per file or per page range,
I still need a custom solution)Regarding the WAL encryption part, we took a completely different
approach, similar to how we handle normal table data (page-based). I
will need to think more about this before I can provide meaningful
feedback on that part of the patch. One initial question, however, is
whether you have run detailed benchmarks with different workloads.
That seems to be the trickiest part there, since most of the code runs
in a critical section. (Not the "unused"/"empty hook" path, but the
overhead caused by a real encryption plugin using this hook in
practice)
On 28/12/2025 4:53 PM, Henson Choi wrote:
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Konstantin,
I have great respect for the work being done on the extensible SMGR API.
It is a powerful tool for use cases that require replacing the entire
storage layer (like Neon's architecture).However, I believe we should distinguish between Storage Management
(where/how data is stored) and Data Transformation (what the data looks
like). I see a strong case for both approaches to coexist for the
following practical reasons:1. Separation of Concerns and Safety
Is it reasonable to ask cryptography experts to clone the entire SMGR
implementation and maintain code they don't fully understand just to
insert encryption logic? If an extension developer clones md.c to add
encryption, they become responsible for the fundamental integrity of
PostgreSQL's file I/O. Any bug in their cloned storage logic could lead
to data loss unrelated to encryption itself.2. The Maintenance Debt of "Cloning"
When md.c receives critical security patches or bug fixes in the core,
every TDE extension maintainer would need to manually backport those
changes to their specific SMGR implementation. This creates a fragmented
ecosystem where security extensions might actually introduce storage
vulnerabilities by running outdated cloned logic.3. Minimalist Integration
The hook approach allows crypto experts to focus strictly on transform()
and reverse_transform(). The complex storage orchestration remains with
the PostgreSQL core where it is most rigorously tested. This is a cleaner
separation of responsibilities: the core provides the trusted pipeline,
and the extension provides the specialized transformation.Conclusion:
I believe these hooks provide a "low-barrier, high-safety" path for data
transformation that the SMGR API—by its very nature of being a full
replacement—cannot easily provide. Let's provide the SMGR for those who
want to reinvent the storage, and hooks for those who simply want to
secure the data.Best regards,
Henson Choi
I do not think that custom SMGR API contradicts to the idea of Data
Transformation.
Do you know about decorator pattern?
If you want to implement i.e. data encryption, you definitely do not
need to write your storage manager from the scratch.
Obviously you can (and should) use standard storage manager (md.c) for
actually performing IO.
But your storage manager can perform some extra action prior of after
IO, for example encrypt data before write and decrypt it after read.
So any pre/post/instead hooks can be easily implemented using custom SMGR.
Opposite unfortunately is not possible. You can not for example
implement encryption+compression using hooks.
But you can easily do it using custom SMGR: this is how compressed file
system (CFS) was implemented in PgPro.
Hi Konstantin,
I understand the decorator pattern, and yes, it can work for some cases.
But decorators can only intercept at the beginning and end of functions.
Looking at the actual hook locations in md.c:
- mdextend_pre_hook: after error checks, before file open → Decorator
possible
- mdwrite_pre_hook: after assertions, before I/O loop → Decorator possible
- mdread_post_hook: inside the segment loop → Decorator NOT possible
The mdreadv() function, introduced in PostgreSQL 17 as part of the
vectored I/O API, processes multiple blocks in a loop that respects
segment boundaries. The decryption hook must be called inside this loop,
after each segment's FileReadV() completes. A decorator wrapping mdreadv()
from the outside cannot access this internal loop timing.
With the SMGR decorator approach, the extension developer must:
- Track upstream md.c changes
- Replicate the internal loop logic to find the right decryption point
With hooks, the extension developer only needs to:
- Implement encrypt() and decrypt()
Regarding encryption+compression: that's a valid use case for SMGR,
but our primary concern is different. In South Korea, government
regulations require the use of nationally-approved cryptographic
algorithms (such as ARIA, SEED). This means organizations often cannot
adopt foreign TDE solutions, regardless of their technical merit.
We need a simple, stable hook interface that allows local security
experts to integrate these required algorithms - experts who understand
cryptography but not PostgreSQL storage internals.
If both approaches can coexist, why not provide hooks for the simple
case and SMGR for the complex case?
Best regards,
Henson Choi
2025년 12월 29일 (월) AM 12:27, Konstantin Knizhnik <knizhnik@garret.ru>님이 작성:
Show quoted text
On 28/12/2025 4:53 PM, Henson Choi wrote:
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Konstantin,
I have great respect for the work being done on the extensible SMGR API.
It is a powerful tool for use cases that require replacing the entire
storage layer (like Neon's architecture).However, I believe we should distinguish between Storage Management
(where/how data is stored) and Data Transformation (what the data looks
like). I see a strong case for both approaches to coexist for the
following practical reasons:1. Separation of Concerns and Safety
Is it reasonable to ask cryptography experts to clone the entire SMGR
implementation and maintain code they don't fully understand just to
insert encryption logic? If an extension developer clones md.c to add
encryption, they become responsible for the fundamental integrity of
PostgreSQL's file I/O. Any bug in their cloned storage logic could lead
to data loss unrelated to encryption itself.2. The Maintenance Debt of "Cloning"
When md.c receives critical security patches or bug fixes in the core,
every TDE extension maintainer would need to manually backport those
changes to their specific SMGR implementation. This creates a fragmented
ecosystem where security extensions might actually introduce storage
vulnerabilities by running outdated cloned logic.3. Minimalist Integration
The hook approach allows crypto experts to focus strictly on transform()
and reverse_transform(). The complex storage orchestration remains with
the PostgreSQL core where it is most rigorously tested. This is a cleaner
separation of responsibilities: the core provides the trusted pipeline,
and the extension provides the specialized transformation.Conclusion:
I believe these hooks provide a "low-barrier, high-safety" path for data
transformation that the SMGR API—by its very nature of being a full
replacement—cannot easily provide. Let's provide the SMGR for those who
want to reinvent the storage, and hooks for those who simply want to
secure the data.Best regards,
Henson ChoiI do not think that custom SMGR API contradicts to the idea of Data
Transformation.
Do you know about decorator pattern?
If you want to implement i.e. data encryption, you definitely do not
need to write your storage manager from the scratch.
Obviously you can (and should) use standard storage manager (md.c) for
actually performing IO.
But your storage manager can perform some extra action prior of after
IO, for example encrypt data before write and decrypt it after read.
So any pre/post/instead hooks can be easily implemented using custom SMGR.Opposite unfortunately is not possible. You can not for example
implement encryption+compression using hooks.
But you can easily do it using custom SMGR: this is how compressed file
system (CFS) was implemented in PgPro.
On 28/12/2025 5:51 PM, Henson Choi wrote:
Hi Konstantin,
I understand the decorator pattern, and yes, it can work for some cases.
But decorators can only intercept at the beginning and end of functions.Looking at the actual hook locations in md.c:
- mdextend_pre_hook: after error checks, before file open → Decorator
possible
- mdwrite_pre_hook: after assertions, before I/O loop → Decorator possible
- mdread_post_hook: inside the segment loop → Decorator NOT possibleThe mdreadv() function, introduced in PostgreSQL 17 as part of the
vectored I/O API, processes multiple blocks in a loop that respects
segment boundaries. The decryption hook must be called inside this loop,
after each segment's FileReadV() completes. A decorator wrapping mdreadv()
from the outside cannot access this internal loop timing.With the SMGR decorator approach, the extension developer must:
- Track upstream md.c changes
- Replicate the internal loop logic to find the right decryption pointWith hooks, the extension developer only needs to:
- Implement encrypt() and decrypt()Regarding encryption+compression: that's a valid use case for SMGR,
but our primary concern is different. In South Korea, government
regulations require the use of nationally-approved cryptographic
algorithms (such as ARIA, SEED). This means organizations often cannot
adopt foreign TDE solutions, regardless of their technical merit.We need a simple, stable hook interface that allows local security
experts to integrate these required algorithms - experts who understand
cryptography but not PostgreSQL storage internals.If both approaches can coexist, why not provide hooks for the simple
case and SMGR for the complex case?Best regards,
Henson Choi
Hi Henson,
Thank you for explanations.
I personally do not like hooks, I considered them as some kind of
crutches which are needed to fix some problems with existed APIs:)
But them are quite popular in Postgres and really make it extensible.
The task of transparent data encryption is really very important for
Postgres (if for some reasons it can not be done at file system level).
If we need to add more hooks to make it possible to add to Postgres,
then dozen of yet another hooks may be acceptable...
I have not investigated it precisely, may be you are right that it is
possible to implement transparent encryption using using decorator
approach and custom SMGR. Frankly speaking I am quite upset how AIO was
added to PG18. It introduces orthogonal hierarchy to SMGR and cause
some tight dependencies between this two modules which makes extension
of any of them problematic if ever possible (i.e. if I want to add my
storage manager and make AIO use it to access files system, rather than
calling pread/pwrite directly). I am not sure that AIO can not be added
through SMGR hierarchy (certainly by extending this interface), but it
is certainly separate store having no relation to the topic of this
discussion.
So I can assume that current coupling of AIO with SMGR makes it not
possible to plugin transparent encryption rather than adding this hooks.
Still not quite sure that proposed set of hooks is absolutely necessary
and sufficient...
- mdread_post_hook: inside the segment loop → Decorator NOT possible
The mdreadv() function, introduced in PostgreSQL 17 as part of the
vectored I/O API, processes multiple blocks in a loop that respects
segment boundaries. The decryption hook must be called inside this loop,
after each segment's FileReadV() completes. A decorator wrapping mdreadv()
from the outside cannot access this internal loop timing.
It is possible - or rather, we plan to propose a different patch for
that. There are already some discussions about extendibility of AIO,
which is currently quite minimal, and this is another point for that.
If you look into the AIO sources, it already uses an array of
callbacks, and there's only a small missing piece there - making it
possible for extensions to add entries to that array. With that patch,
it is possible to decorate smgr_startreadv, add your own callback, and
then call the original mdstartreadv function. Since aio callbacks are
executed in the opposite order, this will work out exactly as needed,
as the AIO handler will first call the md completion handler, then
yours.
My logic here is similar to the previous argument: this AIO
extensibility for startreadv is also needed for other uses of the smgr
extension, most likely for everyone who uses the current patch. It
shouldn't be specific to encryption.
With the SMGR decorator approach, the extension developer must:
- Track upstream md.c changes
- Replicate the internal loop logic to find the right decryption point
With hooks, the extension developer only needs to:
- Implement encrypt() and decrypt()
We need a simple, stable hook interface that allows local security
experts to integrate these required algorithms - experts who understand
cryptography but not PostgreSQL storage internals.
Extension developers still have to understand the multiprocess nature
of postgres (with AIO you also have to remember that it is possible
for the completion to happen in a different process, possibly in a
worker process), or its unusual memory management patterns, critical
sections, and so on. You most likely also have to deal with shared
memory caches, locks, and so on.
(And as I said above, you don't have to replicate/track md.c, we only
need a good, generic extension point usable for many extensions)
In South Korea, government
regulations require the use of nationally-approved cryptographic
algorithms (such as ARIA, SEED). This means organizations often cannot
adopt foreign TDE solutions, regardless of their technical merit.
Have you considered contributing to existing solutions? Adding support
to multiple algorithms to an existing library is easier than
developing your own from scratch.
WAL and heap pages are simply different representations of the same
underlying data. Protecting only one side would be cryptographically
incomplete; an attacker could bypass encryption by reading the
unprotected side. Therefore, they must be treated as a single atomic
unit of protection.
From a security point of view, I agree. From a practical one, it's a
bit more complicated. As you mentioned South Korean regulations, we
also have regulations in the European Union, and you can conform to
the current regulations by only encrypting your data files (at least
that's what I heard, I'm not a lawyer).
So from a practical point of view, for us, even getting support for
table encryption hooks into the core would be a success.
My primary concern with using fork files for encryption metadata is crash
recovery. If a fork file and the actual data page become inconsistent
(e.g., during a crash), recovery becomes problematic because fork files
are not typically protected by WAL.
Custom WAL records about encryption events (key rotation/change/etc)
should solve this problem?
I plan to propose a separate RFC for this
"gradual rotation" mechanism.
Would this gradual rotation mechanism be useful for anything else
other than encryption extensions? While I also had the same idea, I
don't see how it would be useful for anything else, so I didn't plan
to submit any patches related to this. This is something that can be
easily implemented as a background worker in a tde extension, and
doesn't really require core support.
On 28/12/2025 5:25 PM, Henson Choi wrote:
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Zsolt,
Thank you for your detailed questions. I'll address each point:
1. Bundling WAL and Buffer Manager
WAL and heap pages are simply different representations of the same
underlying data. Protecting only one side would be cryptographically
incomplete; an attacker could bypass encryption by reading the
unprotected side. Therefore, they must be treated as a single atomic
unit of protection.
I am not expert in cryptography, better say I even dummy in this area.
But I have one concern about proposed WAL encryption (record level
encryption).
Content of some WAL records can be almost completely predicated (it
contains no user data,
just some Postgres internal data which can be easily reconstructed).
I wonder if this fact can significantly simplify task of cracking cypher?
May be it is safer to use page level encryption for WAL also?
On 12/28/25 08:49, Henson Choi wrote:
3. Proposal Specifications
3.1 The Interface (Hook Points)
We allow intervention by security experts through five contact points
along the I/O path:* *Read/Write Hooks:* |mdread_post|, |mdwrite_pre|, |mdextend_pre|
(Transformation of the data area)
* *WAL Hooks:* |xlog_insert_pre|, |xlog_decode_pre| (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of |pd_flags| to define the “Security State” of a
page. This serves as a *Status Message* sent by the security expert to
the engine, utilized for key versioning and as a migration marker.
Isn't this rather problematic?
This seems to be meant to be extensible, which means there can be
multiple extensions setting the hooks. Which we generally allow, and the
custom is to call the previous hook.
What happens if there are multiple extensions implementing the hook?
Would that be allowed or prohibited in this case? Maybe it doesn't make
sense, but then why wouldn't it be possible?
FWIW I find it very unlikely we'd allow reserving pd_flags bits for an
extension. These bits are meant to be used by core, there's very limited
number of such bits.
In general, I'm somewhat skeptical of the claim a collection of hooks is
"low-barrier, high-safety". It seems pretty fragile to me, and I can
envision a lot of maintenance difficulties in the future. Not just for
the extension developers, but for the project too - adding a bunch of
random hooks is not free for us, we'll need to keep it working in future
releases, etc.
Perhaps the current SMGR code is not extensible/flexible enough, but
then we need to improve that. I'd imagine a simple SMGR doing the
encryption, but federating most of the work to a "full" SMGR. But I
haven't thought about that too much.
regards
--
Tomas Vondra
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Zsolt,
Thank you for the detailed technical feedback. Let me address each point.
1. AIO Extensibility and SMGR Approach
I think the SMGR extensibility approach is equally valid. In fact, when I
realized in PG18 that buffer page reads are split between md.c (mdreadv)
and bufmgr.c (buffer_readv_complete_one), I felt some discomfort about
where to place the decryption hook. "Does this really belong in both
places?" was my first thought.
The SMGR approach could provide a cleaner, more unified integration point
for data transformation.
The main difference is timing and current availability:
- The hook approach is working today and can be used immediately
- Your SMGR extensibility work provides a more comprehensive long-term
solution
I don't see these as competing proposals. Both approaches are valid and
serve different needs. The hook infrastructure can serve as an interim
solution for organizations that need TDE now, while the community develops
the more comprehensive SMGR extensibility.
In the long term, if SMGR extensibility provides better integration points,
extensions could migrate to that approach.
2. Understanding PostgreSQL Internals
You're absolutely right that extension developers need to understand
multiprocess architecture, memory management, critical sections, and so on.
This is precisely why test_tde exists as a reference implementation. It
documents the "dance steps" with the core - showing where memory must be
pre-allocated, how to handle critical sections safely, when AIO completion
might happen in a different process, and so on.
The goal isn't to hide PostgreSQL's complexity, but to provide a working
example that shows cryptography experts exactly where and how to integrate
their algorithms within PostgreSQL's constraints.
3. Contributing to Existing Solutions vs Korean Regulations
I appreciate the suggestion about contributing to existing solutions. I
personally prefer the OpenSSL Provider approach for algorithm extensibility.
However, the reality is more complex.
Cryptography experts often have their own libraries developed over decades.
While it might look like "just encryption code" to me, I don't have the
authority to force them to adopt specific frameworks.
ARIA and SEED are already implemented in OpenSSL. However, Korean law
requires certified implementations. Specifically, companies must use
nationally-certified builds and provide the hash codes of those specific
library binaries to regulators. You cannot simply use the OpenSSL version,
even if the algorithm is identical.
This is why we need an extension mechanism rather than hardcoding specific
libraries into core. Different jurisdictions have different certification
requirements.
4. WAL vs Data File Encryption
You mentioned that EU regulations might be satisfied by encrypting only
data files. That's a valid practical consideration.
In Korea, regulations require the introduction of approved cryptographic
algorithms, but in practice most systems run AES due to lack of CPU
acceleration for ARIA/SEED. It's largely a legal compliance checkbox.
Regarding what to protect (WAL vs heap vs both), there's flexibility
depending on the organization and jurisdiction. The hook approach allows
extensions to choose - you can implement only the buffer hooks if that
satisfies your requirements, or add WAL hooks if needed.
5. Fork Files vs Page Header for Metadata
You asked whether custom WAL records about encryption events could solve
the crash recovery problem with fork files.
That's a reasonable approach for SMGR-based solutions where you control the
storage layer. However, with the hook approach, we don't have the ability
to inject custom WAL records for encryption events.
Currently, in a replication environment, the reference implementation
requires the same key to be configured in the settings on both primary and
replicas (shared key model). For future KMS integration, I'm considering
mechanisms to propagate keys to replicas through external channels rather
than WAL.
The page header approach was chosen because it keeps the encryption state
self-contained within each page, avoiding the need for separate metadata
synchronization.
6. Gradual Rotation Mechanism
I agree with you - I don't think core support is necessary for gradual
rotation either.
I mentioned it in my earlier email response only as a potential reference
implementation concept to guide encryption developers. It's something that
can and should be implemented in the extension's background worker, not in
core.
Summary
I see the hook approach and SMGR extensibility as equally valid, addressing
different timelines and use cases:
- Hooks: Available now, lighter-weight, sufficient for compliance-driven TDE
- SMGR extensibility: More comprehensive, cleaner architecture, better
long-term solution
Both should coexist. Organizations can use hooks today while SMGR
extensibility matures, then migrate if the SMGR approach better fits their
needs.
I'm very interested in your experience with pg_tde and the SMGR
extensibility work. If there are specific design considerations from that
work that would inform these hooks, I'd appreciate your input.
Best regards,
Henson
2025년 12월 29일 (월) AM 2:55, Zsolt Parragi <zsolt.parragi@percona.com>님이 작성:
Show quoted text
- mdread_post_hook: inside the segment loop → Decorator NOT possible
The mdreadv() function, introduced in PostgreSQL 17 as part of the
vectored I/O API, processes multiple blocks in a loop that respects
segment boundaries. The decryption hook must be called inside this loop,
after each segment's FileReadV() completes. A decorator wrappingmdreadv()
from the outside cannot access this internal loop timing.
It is possible - or rather, we plan to propose a different patch for
that. There are already some discussions about extendibility of AIO,
which is currently quite minimal, and this is another point for that.
If you look into the AIO sources, it already uses an array of
callbacks, and there's only a small missing piece there - making it
possible for extensions to add entries to that array. With that patch,
it is possible to decorate smgr_startreadv, add your own callback, and
then call the original mdstartreadv function. Since aio callbacks are
executed in the opposite order, this will work out exactly as needed,
as the AIO handler will first call the md completion handler, then
yours.My logic here is similar to the previous argument: this AIO
extensibility for startreadv is also needed for other uses of the smgr
extension, most likely for everyone who uses the current patch. It
shouldn't be specific to encryption.With the SMGR decorator approach, the extension developer must:
- Track upstream md.c changes
- Replicate the internal loop logic to find the right decryption pointWith hooks, the extension developer only needs to:
- Implement encrypt() and decrypt()We need a simple, stable hook interface that allows local security
experts to integrate these required algorithms - experts who understand
cryptography but not PostgreSQL storage internals.Extension developers still have to understand the multiprocess nature
of postgres (with AIO you also have to remember that it is possible
for the completion to happen in a different process, possibly in a
worker process), or its unusual memory management patterns, critical
sections, and so on. You most likely also have to deal with shared
memory caches, locks, and so on.(And as I said above, you don't have to replicate/track md.c, we only
need a good, generic extension point usable for many extensions)In South Korea, government
regulations require the use of nationally-approved cryptographic
algorithms (such as ARIA, SEED). This means organizations often cannot
adopt foreign TDE solutions, regardless of their technical merit.Have you considered contributing to existing solutions? Adding support
to multiple algorithms to an existing library is easier than
developing your own from scratch.WAL and heap pages are simply different representations of the same
underlying data. Protecting only one side would be cryptographically
incomplete; an attacker could bypass encryption by reading the
unprotected side. Therefore, they must be treated as a single atomic
unit of protection.From a security point of view, I agree. From a practical one, it's a
bit more complicated. As you mentioned South Korean regulations, we
also have regulations in the European Union, and you can conform to
the current regulations by only encrypting your data files (at least
that's what I heard, I'm not a lawyer).So from a practical point of view, for us, even getting support for
table encryption hooks into the core would be a success.My primary concern with using fork files for encryption metadata is crash
recovery. If a fork file and the actual data page become inconsistent
(e.g., during a crash), recovery becomes problematic because fork files
are not typically protected by WAL.Custom WAL records about encryption events (key rotation/change/etc)
should solve this problem?I plan to propose a separate RFC for this
"gradual rotation" mechanism.Would this gradual rotation mechanism be useful for anything else
other than encryption extensions? While I also had the same idea, I
don't see how it would be useful for anything else, so I didn't plan
to submit any patches related to this. This is something that can be
easily implemented as a background worker in a tde extension, and
doesn't really require core support.
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Tomas,
Thank you for this critical feedback. Your concerns go to the heart of the
proposal's viability, and I appreciate your directness.
1. Multiple Extensions and Hook Chaining
You're right to question this. To be honest, I have significant doubts
about allowing multiple transformation extensions simultaneously.
The Transform ID coordination problem is real: without a registry or
protocol between extensions, they cannot cooperate safely. Hook chaining
for read/write operations might work (extension A encrypts, extension B
compresses), but the Transform ID field creates conflicts.
Perhaps I should be more direct: transformation hook chaining is not
realistically possible with the current design. TDE extensions would need
exclusive use of these hooks. This is a fundamental limitation I should
have stated clearly in the RFC.
2. pd_flags Reservation - I Hope You'll Consider This
I understand your concern about reserving pd_flags bits for extensions.
However, I'd like to ask you to consider the reasoning behind this choice.
The 5-bit Transform ID serves a critical purpose: it allows the core to
identify the page's transformation state without attempting decryption.
This is important for:
- Error reporting: "This page is encrypted with transform ID 5, but no
extension is loaded to handle it"
- Migration safety: Distinguishing between untransformed pages (ID=0) and
transformed pages during gradual encryption
- Crash recovery: The core can detect transformation state inconsistencies
That said, I recognize pd_flags is precious and limited. Let me propose an
alternative approach that might better align with core principles:
Instead of extension-specific Transform IDs, what if we allow extensions to
reserve space at pd_upper (similar to how special space works at
pd_special)?
The core could manage a small flag (2-3 bits) indicating "N bytes at
pd_upper are reserved for transformation metadata". By encoding N as
multiples of 2 or 4 bytes, we maximize the flag's efficiency:
- 2 bits encoding 4-byte multiples: 0-12 bytes (sufficient for most cases)
- 3 bits encoding 4-byte multiples: 0-28 bytes (covers all reasonable needs)
- 3 bits encoding 2-byte multiples: 0-14 bytes (finer granularity)
This approach uses minimal pd_flags bits while providing substantial
metadata space. It would:
- Keep the flag in core control (not extension-specific)
- Allow extensions to store IV, authentication tags, key version, etc. in a
standardized location
- Be self-describing (the flag tells you how much space is reserved)
- Generalize beyond encryption (compression, checksums, etc. could use it)
In our internal implementation, we actually add opaque bytes to PageHeader
for encryption metadata. This pd_upper approach could formalize that
pattern for extensions.
I believe some form of page-level metadata for transformations is
necessary. Would either approach (Transform ID or pd_upper reservation) be
acceptable with the right design, or do you see fundamental issues with
page-level transformation metadata itself?
3. Maintenance Burden and Test Coverage
I deeply appreciate this concern. Having worked across various DBMS
implementations, I've seen solution vendors ship without comprehensive
regression testing - but never a database vendor. DBMS maintenance is
extraordinarily difficult, and storage errors are catastrophic.
This is precisely why test_tde exists as a reference implementation. But
you've identified the real issue: we need much stronger test coverage for
the hooks themselves.
The test cases should:
- Detect when core changes break hook contracts
- Verify hook behavior under all I/O paths (sync, async, error cases)
- Validate critical section safety
- Test interaction with checksums, crash recovery, replication
I agree the current test coverage is insufficient for core inclusion. Would
expanding the test suite to cover these scenarios address your maintenance
concerns, or do you see fundamental fragility beyond what testing can solve?
4. Hooks vs Transform Layer - Pragmatic Timeline
You suggested improving SMGR extensibility rather than adding hooks. I
think you're architecturally right about the long-term direction.
However, I want to be pragmatic about timelines:
The hook and pd_flags approach, despite its limitations, can deliver
working TDE in the shortest time. Organizations facing regulatory deadlines
need something that works now, not in 2-3 years.
That said, your feedback has sparked a better idea: what if we think of
this not as "SMGR extension" or "hooks" but as a pluggable Transform Layer
that SMGR and WAL subsystems delegate to?
Conceptually:
Application Layer
|
Buffer Manager
|
+------------------+
| Transform Layer | <-- Encryption, etc.
+------------------+
|
SMGR / WAL
|
File I/O
This is architecturally cleaner than scattered hooks, and more focused than
full SMGR extensibility. The Transform Layer would:
- Provide a unified interface for data transformation
- Work across backend, frontend tools, and replication
- Handle metadata management in a standardized way
- Support encryption, compression, or other transformations
I think this deserves its own discussion thread rather than conflating it
with the current hook proposal. Would you be interested in starting a
separate conversation about designing a Transform Layer interface for
PostgreSQL?
In the meantime, the hook approach could serve organizations with immediate
needs, and extensions could migrate to the Transform Layer once it's
stabilized.
5. Frontend Tool Access
Both SMGR and hook approaches face a shared limitation: frontend tools
(pg_checksums, pg_basebackup, etc.) that read files directly.
I previously suggested allowing initdb to specify a shared library that
both backend and frontend can load for transformation. But as I reconsider
this, it feels like it converges toward the Transform Layer idea: a
well-defined interface that any PostgreSQL component can use.
This might be the real architectural question: not "hooks vs SMGR" but "how
should PostgreSQL provide transformation points that work across backend,
frontend, and replication boundaries?"
Summary
Your feedback has clarified three important points:
1. The current hook design has real limitations (multiple extension
conflicts, pd_flags concerns)
2. Test coverage needs to be much more comprehensive
3. A cleaner abstraction might be needed long-term
I propose a dual approach:
Short-term: Move forward with the hook proposal for organizations with
immediate regulatory needs. I commit to:
- Stating clearly that hook chaining is not supported
- Significantly expanding test coverage
- Treating this as a pragmatic solution with known limitations
Long-term: I'd like to start a separate discussion about a Transform Layer
abstraction - a unified interface that could handle data transformation
across backend, frontend tools, and replication. This would be
architecturally cleaner than scattered hooks, and could eventually
supersede this approach.
Would you be willing to review a Transform Layer proposal in a separate
thread? I think it addresses the architectural concerns you've raised,
while the hook approach serves immediate practical needs.
Best regards,
Henson
2025년 12월 29일 (월) AM 4:24, Tomas Vondra <tomas@vondra.me>님이 작성:
Show quoted text
On 12/28/25 08:49, Henson Choi wrote:
3. Proposal Specifications
3.1 The Interface (Hook Points)
We allow intervention by security experts through five contact points
along the I/O path:* *Read/Write Hooks:* |mdread_post|, |mdwrite_pre|, |mdextend_pre|
(Transformation of the data area)
* *WAL Hooks:* |xlog_insert_pre|, |xlog_decode_pre| (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of |pd_flags| to define the “Security State” of a
page. This serves as a *Status Message* sent by the security expert to
the engine, utilized for key versioning and as a migration marker.Isn't this rather problematic?
This seems to be meant to be extensible, which means there can be
multiple extensions setting the hooks. Which we generally allow, and the
custom is to call the previous hook.What happens if there are multiple extensions implementing the hook?
Would that be allowed or prohibited in this case? Maybe it doesn't make
sense, but then why wouldn't it be possible?FWIW I find it very unlikely we'd allow reserving pd_flags bits for an
extension. These bits are meant to be used by core, there's very limited
number of such bits.In general, I'm somewhat skeptical of the claim a collection of hooks is
"low-barrier, high-safety". It seems pretty fragile to me, and I can
envision a lot of maintenance difficulties in the future. Not just for
the extension developers, but for the project too - adding a bunch of
random hooks is not free for us, we'll need to keep it working in future
releases, etc.Perhaps the current SMGR code is not extensible/flexible enough, but
then we need to improve that. I'd imagine a simple SMGR doing the
encryption, but federating most of the work to a "full" SMGR. But I
haven't thought about that too much.regards
--
Tomas Vondra
Hi hackers,
This is the fourth version of the Storage I/O Transformation Hooks patch
series for implementing Transparent Data Encryption (TDE) in PostgreSQL.
Changes in v4:
This version fixes cross-platform compatibility issues found in CI testing
that caused failures on BSD and Windows:
- Fixed BSD regression test warning about tablespace naming conventions
(renamed to "regress_tde_tblspc")
- Fixed Windows test failures caused by platform-specific shell commands
(mkdir -p)
- Replaced filesystem-based tablespace tests with
allow_in_place_tablespaces approach for cross-platform compatibility
The core hook infrastructure (patch 0001) and reference TDE implementation
(patch 0002) remain unchanged from v3. Patch 0003 contains only the test
compatibility fixes.
Patch series:
0001: Core hook infrastructure for I/O transformation
0002: Reference TDE implementation using AES-256-CTR
0003: Cross-platform test fixes for BSD and Windows
Testing:
The test_tde extension demonstrates:
- Page-level encryption/decryption with AES-256-CTR
- IV derivation using LSN, block number, and relation file number
- Tablespace-level encryption configuration
- WAL encryption support
These fixes resolve the BSD and Windows test failures.
Best regards,
2025년 12월 28일 (일) PM 11:19, Henson Choi <assam258@gmail.com>님이 작성:
Show quoted text
Hi,
Here is v3 of the Storage I/O Transform Hooks patch.
Changes from v2:
- Fix -Wincompatible-pointer-types error in bufmgr.c by casting
&bufdata to (void **) for mdread_post_hook callv2 changes were:
- Add meson.build test configuration for test_tde extension--
Best regards,
Sungkyun Park2025년 12월 28일 (일) PM 7:44, Henson Choi <assam258@gmail.com>님이 작성:
Updated patches with meson build support:
v2:
- Added meson.build for test_tde extension
- Added test_tde to contrib/meson.buildRegards,
Henson Choi2025년 12월 28일 (일) PM 6:47, Henson Choi <assam258@gmail.com>님이 작성:
Hello,
Following up on the RFC, I am submitting the initial patch set for the
proposed infrastructure. These patches introduce a minimal hook-based
protocol to allow extensions to handle data transformation, such as TDE,
while keeping the PostgreSQL core independent of specific cryptographic
implementations.Implementation Details:
Hook Points in Storage I/O Path
The patch introduces five strategic hook points:mdread_post_hook: Called after blocks are read from disk. The extension
can reverse-transform data in place.mdwrite_pre_hook & mdextend_pre_hook: Called before writing or extending
blocks. These hooks return a pointer to transformed buffers.xlog_insert_pre_hook & xlog_decode_pre_hook: Handle transformation for
WAL records during insertion and replay.Data Integrity and Checksum Protocol
To ensure robust error detection, the hooks follow a specific
verification protocol:On Write: The extension transforms the page, sets the Transform ID, then
recalculates the checksum on the transformed data.On Read: The extension verifies the on-disk checksum of the transformed
data first. After reverse-transformation, it clears the Transform ID and
recalculates the checksum for the plaintext data. This ensures corruption
is detected regardless of the transformation state.WAL Safety via XLR_BLOCK_ID_TRANSFORMED (251)
For WAL records, I have introduced a specific block ID (251) to mark
transformed data. If the decryption extension is not loaded, the WAL reader
will encounter this unknown block ID and fail-fast, preventing the system
from incorrectly interpreting encrypted data as valid WAL records.PageHeader Transform ID (5-bit)
I have allocated bits 3-7 of pd_flags in the PageHeader for a Transform
ID. This allows the engine and extensions to identify the transformation
state of a page (e.g., key versioning or algorithm type) without attempting
decryption. It ensures backward compatibility: pages with Transform ID 0
are treated as standard untransformed pages.Memory and Critical Section Safety
As demonstrated in the contrib/test_tde reference implementation, cipher
contexts are pre-allocated in _PG_init to avoid memory allocation during
critical sections. For WAL transformation,
MemoryContextAllowInCriticalSection() is used to allow buffer reallocation
within critical sections; if OOM occurs during buffer growth, it results in
a controlled PANIC.Performance Considerations
When hooks are not set (default), the overhead is limited to a single
NULL pointer comparison per I/O operation. This is architecturally
consistent with existing PostgreSQL hooks and is designed to have a
negligible impact on performance.Attached Patches:
v20251228-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patch:
Core infrastructure.
v20251228-0002-Add-test_tde-extension-for-TDE-testing.patch: Reference
implementation using AES-256-CTR.I look forward to your comments and feedback.
Regards,
Henson Choi
2025년 12월 28일 (일) PM 4:49, Henson Choi <assam258@gmail.com>님이 작성:
RFC: PostgreSQL Storage I/O Transformation Hooks Infrastructure for a
Technical Protocol Between RDBMS Core and Data Security Experts*Author:* Henson Choi assam258@gmail.com
*Date:* 2025-12-28
*PostgreSQL Version:* master (Development)
------------------------------
1. Summary & MotivationThis RFC proposes the introduction of minimal hooks into the PostgreSQL
storage layer and the addition of a *Transformation ID* field to the
PageHeader.
A Diplomatic Protocol Between Expert GroupsThe core motivation of this proposal is *“Separation of Concerns and
Mutual Respect.”*Historically, discussions around Transparent Data Encryption (TDE) have
often felt like putting security experts on trial in a foreign
court—specifically, the “Court of RDBMS.” It is time to treat them not as
defendants to be judged by database-specific rules, but as an *equal
neighboring community* with their own specialized sovereignty.*The issue has never been a failure of technology, but rather a
misplacement of the focal point.* While previous discussions were
mired in the technicalities of “how to hardcode encryption into the core,”
this proposal shifts the debate toward an architectural solution: “what
interface the core should provide to external experts.”- *RDBMS Experts* provide a trusted pipeline responsible for data
I/O paths and consistency.
- *Security Experts* take responsibility for the specialized domain
of encryption algorithms and key management.This hook system functions as a *Technical Protocol*—a high-level
agreement that allows these two expert groups to exchange data securely
without encroaching on each other’s territory.
------------------------------
2. Design Principles1. *Delegation of Authority:* The core remains independent of
specific encryption standards, providing a “free territory” where security
experts can respond to an ever-changing security landscape.
2. *Diplomatic Convention:* The Transformation ID acts as a
communication protocol between the engine and the extension. The engine
uses this ID to identify the state of the data and hands over control to
the appropriate expert (the extension).
3. *Minimal Interference:* Overhead is kept near zero when hooks
are not in use, ensuring the native performance of the PostgreSQL engine.------------------------------
3. Proposal Specifications 3.1 The Interface (Hook Points)We allow intervention by security experts through five contact points
along the I/O path:- *Read/Write Hooks:* mdread_post, mdwrite_pre, mdextend_pre
(Transformation of the data area)
- *WAL Hooks:* xlog_insert_pre, xlog_decode_pre (Transformation of
transaction logs)3.2 The Protocol Identifier (PageHeader Transformation ID)
We allocate 5 bits of pd_flags to define the “Security State” of a
page. This serves as a *Status Message* sent by the security expert to
the engine, utilized for key versioning and as a migration marker.
------------------------------
4. Reference Implementation: contrib/test_tde A Standard Code of
Conduct for Security ExpertsThis reference implementation exists not as a commercial product, but
to define the *Standards of the Diplomatic Protocol* that
encryption/decryption experts must follow when entering the PostgreSQL
domain.1. *Deterministic IV Derivation:* Demonstrates how to achieve
cryptographic safety by trusting unique values provided by the engine
(e.g., LSN).
2. *Critical Section Safety:* Defines memory management regulations
that security logic must follow within “Critical Sections” to maintain
system stability.
3. *Hook Chaining:* Demonstrates a cooperative structure that
allows peaceful coexistence with other expert tools (e.g., compression,
auditing).------------------------------
5. Scope- *In-Scope:* Backend hook infrastructure, Transformation ID field,
and reference code demonstrating diplomatic protocol compliance.
- *Out-of-Scope:* Specific Key Management Systems (KMS), selection
of specific cryptographic algorithms, and integration with external tools.This proposal represents a strategic diplomatic choice: rather than the
PostgreSQL core assuming all security responsibilities, it grants security
experts a *sovereign territory through extensions* where they can
perform at their best.
Attachments:
v20251229-v4-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchapplication/octet-stream; name=v20251229-v4-0001-Add-Storage-I-O-Transform-Hooks-for-PostgreSQL.patchDownload
From 82ce5cc05f1ce0311a2eedd559f1db7a7703f126 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:50:12 +0900
Subject: [PATCH v4 v4 1/3] Add Storage I/O Transform Hooks for PostgreSQL
This patch introduces a set of hook points that allow extensions to
intercept and transform data during storage I/O operations. The hooks
are designed to support transparent data encryption (TDE) and similar
use cases that require data transformation at the storage layer.
The following hooks are added:
- page_encrypt_hook / page_decrypt_hook in bufmgr.c for buffer page
transformation during read/write operations
- xlog_insert_pre_hook in xloginsert.c for WAL record transformation
before assembly
- xlog_decrypt_record_hook in xlogreader.c for WAL record
transformation during replay
- smgr_write_transform_hook / smgr_read_transform_hook in md.c for
low-level storage manager I/O transformation
Each hook is optional and defaults to NULL, ensuring no overhead when
extensions are not loaded.
Author: Henson Choi <assam258@gmail.com>
---
src/backend/access/transam/xloginsert.c | 10 ++++
src/backend/access/transam/xlogreader.c | 21 ++++++++
src/backend/storage/buffer/bufmgr.c | 9 ++++
src/backend/storage/smgr/md.c | 20 ++++++++
src/include/access/xloginsert.h | 20 ++++++++
src/include/access/xlogreader.h | 20 ++++++++
src/include/access/xlogrecord.h | 5 ++
src/include/storage/bufpage.h | 25 +++++++++-
src/include/storage/md.h | 65 +++++++++++++++++++++++++
9 files changed, 194 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index a56d5a55282..f518ef3f16f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -136,6 +136,12 @@ static bool begininsert_called = false;
/* Memory context to hold the registered buffer and data references. */
static MemoryContext xloginsert_cxt;
+/*
+ * Hook variable for WAL insert transformation (e.g., encryption).
+ * Extensions can set this hook to transform WAL data before assembly.
+ */
+xlog_insert_pre_hook_type xlog_insert_pre_hook = NULL;
+
static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr RedoRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn, int *num_fpi,
@@ -526,6 +532,10 @@ XLogInsert(RmgrId rmid, uint8 info)
&fpw_lsn, &num_fpi, &fpi_bytes,
&topxid_included);
+ /* Pre-insert hook for transformation (e.g., encryption) */
+ if (xlog_insert_pre_hook)
+ rdt = xlog_insert_pre_hook(rdt);
+
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
fpi_bytes, topxid_included);
} while (!XLogRecPtrIsValid(EndPos));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5e5001b2101..169f2b06fc5 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -40,6 +40,13 @@
#include "common/logging.h"
#endif
+/*
+ * Hook variable for WAL record transformation (e.g., decryption).
+ * Extensions can set this hook to transform raw WAL data before decoding.
+ * Frontend tools can also set this hook at startup.
+ */
+xlog_decode_pre_hook_type xlog_decode_pre_hook = NULL;
+
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
pg_attribute_printf(2, 3);
static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -843,6 +850,11 @@ restart:
Assert(gotheader);
record = (XLogRecord *) state->readRecordBuf;
+
+ /* Pre-validation hook for transformation (e.g., decryption) */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, true);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
@@ -862,6 +874,15 @@ restart:
goto err;
/* Record does not cross a page boundary */
+
+ /*
+ * Pre-validation hook for transformation (e.g., decryption).
+ * inplace_allowed is false because record points to readBuf, which
+ * may be copied back to WAL files (e.g., FinishWalRecovery).
+ */
+ if (xlog_decode_pre_hook)
+ record = xlog_decode_pre_hook(state, record, RecPtr, false);
+
if (!ValidXLogRecord(state, record, RecPtr))
goto err;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb55102b0d7..ea0b62e98f2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/md.h"
#include "storage/proc.h"
#include "storage/read_stream.h"
#include "storage/smgr.h"
@@ -7401,6 +7402,14 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ);
#endif
+ /* Decrypt block before checksum verification */
+ if (mdread_post_hook)
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+ mdread_post_hook(&rlocator, tag.forkNum, tag.blockNum, (void **) &bufdata, 1);
+ }
+
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 71bcdeb6601..5416128d2cc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -96,6 +96,14 @@ typedef struct _MdfdVec
static MemoryContext MdCxt; /* context for all MdfdVec objects */
+/*
+ * Hook variables for I/O transformation (e.g., encryption/decryption).
+ * Extensions can set these hooks to transform data during storage I/O.
+ */
+mdread_post_hook_type mdread_post_hook = NULL;
+mdwrite_pre_hook_type mdwrite_pre_hook = NULL;
+mdextend_pre_hook_type mdextend_pre_hook = NULL;
+
/* Populate a file tag describing an md.c segment file. */
#define INIT_MD_FILETAG(a,xx_rlocator,xx_forknum,xx_segno) \
@@ -513,6 +521,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
relpath(reln->smgr_rlocator, forknum).str,
InvalidBlockNumber)));
+ /* Pre-extend hook for transformation (e.g., encryption) */
+ if (mdextend_pre_hook)
+ buffer = mdextend_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -972,6 +984,10 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
}
+ /* Post-read hook for transformation (e.g., decryption) */
+ if (mdread_post_hook)
+ mdread_post_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks_this_segment);
+
nblocks -= nblocks_this_segment;
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
@@ -1064,6 +1080,10 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
#endif
+ /* Pre-write hook for transformation (e.g., encryption) */
+ if (mdwrite_pre_hook)
+ buffers = mdwrite_pre_hook(&reln->smgr_rlocator.locator, forknum, blocknum, buffers, nblocks);
+
while (nblocks > 0)
{
struct iovec iov[PG_IOV_MAX];
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index d6a71415d4f..cc54459ad33 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -19,6 +19,26 @@
#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+/* Forward declaration for XLogRecData */
+struct XLogRecData;
+
+/*
+ * Hook function type for WAL insert transformation (e.g., encryption).
+ * Called after XLogRecordAssemble() but before XLogInsertRecord().
+ * Extension can transform the assembled WAL record data for encryption.
+ * Returns the (possibly modified) XLogRecData chain to be inserted.
+ *
+ * The first node's data points to XLogRecord header, which contains
+ * xl_rmid and xl_info if needed by the hook.
+ *
+ * On failure, the hook should either PANIC or return the original rdata
+ * as fallback.
+ */
+typedef struct XLogRecData *(*xlog_insert_pre_hook_type) (struct XLogRecData *rdata);
+
+/* Hook variable for WAL insert transformation */
+extern PGDLLIMPORT xlog_insert_pre_hook_type xlog_insert_pre_hook;
+
/*
* The minimum size of the WAL construction working area. If you need to
* register more than XLR_NORMAL_MAX_BLOCK_ID block references or have more
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index dfabbbd57d4..898d52a1013 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -400,6 +400,26 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
XLogRecPtr lsn,
char **errormsg);
+/*
+ * Hook function type for WAL record transformation (e.g., decryption).
+ * Called before ValidXLogRecord() and DecodeXLogRecord().
+ * Extension can decrypt or transform the raw record data.
+ * Returns the (possibly modified) XLogRecord to be validated and decoded.
+ *
+ * If inplace_allowed is true, the hook may modify the record in place.
+ * If false, the hook must allocate a new buffer and return it.
+ *
+ * On failure, the hook should either PANIC or return the original record
+ * as fallback.
+ */
+typedef XLogRecord *(*xlog_decode_pre_hook_type) (XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Hook variable for WAL record transformation */
+extern PGDLLIMPORT xlog_decode_pre_hook_type xlog_decode_pre_hook;
+
/*
* Macros that provide access to parts of the record most recently returned by
* XLogReadRecord() or XLogNextRecord().
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index a06833ce0a3..9cfb2aff5ae 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -244,5 +244,10 @@ typedef struct XLogRecordDataHeaderLong
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#define XLR_BLOCK_ID_TOPLEVEL_XID 252
+/*
+ * I/O transform hook marker. Uses same header format as XLogRecordDataHeaderLong
+ * (1 byte id + 4 bytes length). Use SizeOfXLogRecordDataHeaderLong for size.
+ */
+#define XLR_BLOCK_ID_TRANSFORMED 251
#endif /* XLOGRECORD_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..f18f77d3d22 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -189,7 +189,17 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+/*
+ * Transform ID field (5 bits: values 0-31) for I/O transform extensions.
+ * Value 0 means the page is not transformed (backward compatible).
+ * Values 1-31 are available for extensions to define their own meanings
+ * (e.g., encryption key versions, algorithm identifiers, migration markers).
+ */
+#define PD_TRANSFORM_ID_MASK 0x00F8 /* bits 3-7 */
+#define PD_TRANSFORM_ID_SHIFT 3
+#define PD_TRANSFORM_NONE 0 /* not transformed (core reserved) */
+
+#define PD_VALID_FLAG_BITS 0x00FF /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -441,6 +451,19 @@ PageClearAllVisible(Page page)
((PageHeader) page)->pd_flags &= ~PD_ALL_VISIBLE;
}
+static inline uint8
+PageGetTransformId(const PageData *page)
+{
+ return (((const PageHeaderData *) page)->pd_flags & PD_TRANSFORM_ID_MASK) >> PD_TRANSFORM_ID_SHIFT;
+}
+static inline void
+PageSetTransformId(Page page, uint8 id)
+{
+ ((PageHeader) page)->pd_flags =
+ (((PageHeader) page)->pd_flags & ~PD_TRANSFORM_ID_MASK) |
+ ((id << PD_TRANSFORM_ID_SHIFT) & PD_TRANSFORM_ID_MASK);
+}
+
/*
* These two require "access/transam.h", so left as macros.
*/
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b563c27abf0..0a766a2b61f 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -22,6 +22,71 @@
extern PGDLLIMPORT const PgAioHandleCallbacks aio_md_readv_cb;
+/*
+ * Hook function types for I/O transformation (e.g., encryption/decryption).
+ * These hooks allow extensions to transform data during storage I/O operations.
+ */
+
+/*
+ * Called after blocks are read from disk, before PostgreSQL's checksum verification.
+ * Extension can reverse-transform (e.g., decrypt) the data in place.
+ *
+ * For synchronous reads, called from mdreadv() after read completes.
+ * For AIO reads, called from buffer_readv_complete_one() before PageIsVerified().
+ *
+ * Note: The hook is responsible for verifying on-disk checksum before reverse
+ * transformation and recalculating checksum after transformation. This ensures
+ * data integrity is verified at both stages and PostgreSQL's checksum verification
+ * passes.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors).
+ */
+typedef void (*mdread_post_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdwritev() writes blocks to disk.
+ * Extension can transform (e.g., encrypt) data.
+ * Returns pointer to transformed buffers array (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: The hook should recalculate checksum on transformed data after
+ * transformation. This on-disk checksum will be verified on read before
+ * reverse transformation, ensuring disk-level data integrity.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffers with a WARNING as fallback.
+ */
+typedef const void **(*mdwrite_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+
+/*
+ * Called before mdextend() extends a relation with new blocks.
+ * Returns pointer to transformed buffer (hook manages the memory,
+ * typically using static local storage).
+ *
+ * Note: Same as write hook - the hook should recalculate checksum on
+ * transformed data after transformation.
+ *
+ * On failure, the hook should raise an ERROR (or PANIC for critical errors),
+ * or return the original buffer with a WARNING as fallback.
+ */
+typedef const void *(*mdextend_pre_hook_type) (RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+
+/* Hook variables for I/O transformation */
+extern PGDLLIMPORT mdread_post_hook_type mdread_post_hook;
+extern PGDLLIMPORT mdwrite_pre_hook_type mdwrite_pre_hook;
+extern PGDLLIMPORT mdextend_pre_hook_type mdextend_pre_hook;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
--
2.50.1 (Apple Git-155)
v20251229-v4-0002-Add-test_tde-extension-for-TDE-testing.patchapplication/octet-stream; name=v20251229-v4-0002-Add-test_tde-extension-for-TDE-testing.patchDownload
From 68179d7770a4bd8abed5aabb261ef1e03f838500 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Tue, 2 Dec 2025 21:51:13 +0900
Subject: [PATCH v4 v4 2/3] Add test_tde extension for TDE testing
This extension provides a reference implementation for validating the
Storage I/O Transform Hooks introduced in the previous commit. It uses
AES-256-CTR encryption with IV derived from page metadata (LSN, block
number, relation file number) to ensure uniqueness.
The extension registers hooks for:
- Buffer page read/write transformation (mdread/mdwrite/mdextend)
- WAL record insert and replay transformation
Key features:
- Encryption key configured via test_tde.key GUC (256-bit hex)
- System catalogs and pg_global tablespace excluded from encryption
- Pre-allocated cipher context to avoid allocation in critical sections
- WAL records marked with block ID 251 for encrypted record detection
This is intended for development and testing purposes only, not for
production use. The implementation lacks key rotation, proper key
management, and security auditing.
Author: Henson Choi <assam258@gmail.com>
---
contrib/Makefile | 4 +-
contrib/meson.build | 1 +
contrib/test_tde/.gitignore | 3 +
contrib/test_tde/Makefile | 27 +
contrib/test_tde/expected/basic.out | 177 +++++
contrib/test_tde/meson.build | 37 +
contrib/test_tde/sql/basic.sql | 146 ++++
contrib/test_tde/test_tde.c | 1131 +++++++++++++++++++++++++++
contrib/test_tde/test_tde.conf | 2 +
9 files changed, 1526 insertions(+), 2 deletions(-)
create mode 100644 contrib/test_tde/.gitignore
create mode 100644 contrib/test_tde/Makefile
create mode 100644 contrib/test_tde/expected/basic.out
create mode 100644 contrib/test_tde/meson.build
create mode 100644 contrib/test_tde/sql/basic.sql
create mode 100644 contrib/test_tde/test_tde.c
create mode 100644 contrib/test_tde/test_tde.conf
diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..151eb823850 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -54,9 +54,9 @@ SUBDIRS = \
vacuumlo
ifeq ($(with_ssl),openssl)
-SUBDIRS += pgcrypto sslinfo
+SUBDIRS += pgcrypto sslinfo test_tde
else
-ALWAYS_SUBDIRS += pgcrypto sslinfo
+ALWAYS_SUBDIRS += pgcrypto sslinfo test_tde
endif
ifneq ($(with_uuid),no)
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..a592b947702 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -65,6 +65,7 @@ subdir('sslinfo')
subdir('tablefunc')
subdir('tcn')
subdir('test_decoding')
+subdir('test_tde')
subdir('tsm_system_rows')
subdir('tsm_system_time')
subdir('unaccent')
diff --git a/contrib/test_tde/.gitignore b/contrib/test_tde/.gitignore
new file mode 100644
index 00000000000..2ea3752951a
--- /dev/null
+++ b/contrib/test_tde/.gitignore
@@ -0,0 +1,3 @@
+log
+results
+tmp_check
diff --git a/contrib/test_tde/Makefile b/contrib/test_tde/Makefile
new file mode 100644
index 00000000000..b2455d3831e
--- /dev/null
+++ b/contrib/test_tde/Makefile
@@ -0,0 +1,27 @@
+# contrib/test_tde/Makefile
+
+MODULE_big = test_tde
+OBJS = \
+ $(WIN32RES) \
+ test_tde.o
+
+PGFILEDESC = "test_tde - reference implementation for I/O transform hooks"
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_tde/test_tde.conf
+REGRESS = basic
+# Disabled because these tests require "shared_preload_libraries=test_tde"
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_tde
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# OpenSSL is required for encryption
+SHLIB_LINK += $(filter -lcrypto, $(LIBS))
diff --git a/contrib/test_tde/expected/basic.out b/contrib/test_tde/expected/basic.out
new file mode 100644
index 00000000000..9932cf43614
--- /dev/null
+++ b/contrib/test_tde/expected/basic.out
@@ -0,0 +1,177 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+-- Show current settings
+SHOW test_tde.key;
+ test_tde.key
+------------------------------------------------------------------
+ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
+(1 row)
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+ id | secret_data | secret_number
+----+------------------------+---------------
+ 1 | This is secret data | 12345
+ 2 | Another secret message | 67890
+ 3 | PostgreSQL TDE test | 11111
+(3 rows)
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+ id | secret_data | secret_number
+----+----------------+---------------
+ 1 | Updated secret | 12345
+(1 row)
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+ count
+-------
+ 13
+(1 row)
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+ id | secret_data | secret_number
+----+-------------+---------------
+ 14 | |
+(1 row)
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+ secret_data
+----------------
+ Updated secret
+(1 row)
+
+-- Clean up
+DROP TABLE test_encrypt;
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+-----------------
+ 1 | before truncate
+(1 row)
+
+TRUNCATE test_truncate;
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+ id | data
+----+----------------
+ 2 | after truncate
+(1 row)
+
+DROP TABLE test_truncate;
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+CLUSTER test_cluster USING test_cluster_pkey;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+ count
+-------
+ 100
+(1 row)
+
+SELECT * FROM test_cluster WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+DROP TABLE test_cluster;
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+VACUUM FULL test_vacuum_full;
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_vacuum_full;
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+REINDEX INDEX test_reindex_pkey;
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+ id | data
+----+---------
+ 50 | data 50
+(1 row)
+
+RESET enable_seqscan;
+DROP TABLE test_reindex;
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+ count
+-------
+ 50
+(1 row)
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/meson.build b/contrib/test_tde/meson.build
new file mode 100644
index 00000000000..329e1a4b8e2
--- /dev/null
+++ b/contrib/test_tde/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+if not ssl.found()
+ subdir_done()
+endif
+
+test_tde_sources = files(
+ 'test_tde.c',
+)
+
+if host_system == 'windows'
+ test_tde_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_tde',
+ '--FILEDESC', 'test_tde - reference implementation for I/O transform hooks',])
+endif
+
+test_tde = shared_module('test_tde',
+ test_tde_sources,
+ kwargs: contrib_mod_args + {
+ 'dependencies': [ssl, contrib_mod_args['dependencies']]
+ },
+)
+contrib_targets += test_tde
+
+tests += {
+ 'name': 'test_tde',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'basic',
+ ],
+ 'regress_args': ['--temp-config', files('test_tde.conf')],
+ # Disabled because these tests require "shared_preload_libraries=test_tde"
+ 'runningcheck': false,
+ },
+}
diff --git a/contrib/test_tde/sql/basic.sql b/contrib/test_tde/sql/basic.sql
new file mode 100644
index 00000000000..9b2651afee8
--- /dev/null
+++ b/contrib/test_tde/sql/basic.sql
@@ -0,0 +1,146 @@
+-- Basic test for test_tde extension
+-- Verify that encryption/decryption works correctly
+
+-- Show current settings
+SHOW test_tde.key;
+
+-- Create a test table
+CREATE TABLE test_encrypt (
+ id serial PRIMARY KEY,
+ secret_data text,
+ secret_number integer
+);
+
+-- Insert some data
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES
+ ('This is secret data', 12345),
+ ('Another secret message', 67890),
+ ('PostgreSQL TDE test', 11111);
+
+-- Force a checkpoint to ensure data is written to disk
+CHECKPOINT;
+
+-- Read data back - should be decrypted correctly
+SELECT * FROM test_encrypt ORDER BY id;
+
+-- Update some data
+UPDATE test_encrypt SET secret_data = 'Updated secret' WHERE id = 1;
+
+-- Verify update worked
+SELECT * FROM test_encrypt WHERE id = 1;
+
+-- Test with larger data
+INSERT INTO test_encrypt (secret_data, secret_number)
+SELECT
+ repeat('Large data block ', 100),
+ generate_series
+FROM generate_series(1, 10);
+
+-- Count rows
+SELECT COUNT(*) FROM test_encrypt;
+
+-- Test with NULL values
+INSERT INTO test_encrypt (secret_data, secret_number) VALUES (NULL, NULL);
+SELECT * FROM test_encrypt WHERE secret_data IS NULL;
+
+-- Test index creation (index pages should also be encrypted)
+CREATE INDEX ON test_encrypt (secret_number);
+
+-- Use the index
+SELECT secret_data FROM test_encrypt WHERE secret_number = 12345;
+
+-- Clean up
+DROP TABLE test_encrypt;
+
+-- =============================================================================
+-- DDL Tests: Operations that change RelFileNumber
+-- These operations create new files and write records through storage hooks,
+-- so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 1: TRUNCATE (creates new file, writes through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_truncate (id int, data text);
+INSERT INTO test_truncate VALUES (1, 'before truncate');
+SELECT * FROM test_truncate;
+
+TRUNCATE test_truncate;
+
+-- Insert new data after truncate - works fine (new file, new encryption through hooks)
+INSERT INTO test_truncate VALUES (2, 'after truncate');
+SELECT * FROM test_truncate;
+
+DROP TABLE test_truncate;
+
+-- -----------------------------------------------------------------------------
+-- Test 2: CLUSTER (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_cluster (id int PRIMARY KEY, data text);
+INSERT INTO test_cluster SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+CLUSTER test_cluster USING test_cluster_pkey;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_cluster;
+SELECT * FROM test_cluster WHERE id = 50;
+
+DROP TABLE test_cluster;
+
+-- -----------------------------------------------------------------------------
+-- Test 3: VACUUM FULL (rewrites table through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_vacuum_full (id int, data text);
+INSERT INTO test_vacuum_full SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+DELETE FROM test_vacuum_full WHERE id > 50;
+CHECKPOINT;
+
+VACUUM FULL test_vacuum_full;
+
+-- Works fine - data rewritten through storage hooks
+SELECT COUNT(*) FROM test_vacuum_full;
+
+DROP TABLE test_vacuum_full;
+
+-- -----------------------------------------------------------------------------
+-- Test 4: REINDEX (rebuilds index through hooks)
+-- -----------------------------------------------------------------------------
+CREATE TABLE test_reindex (id int PRIMARY KEY, data text);
+INSERT INTO test_reindex SELECT g, 'data ' || g FROM generate_series(1, 100) g;
+CHECKPOINT;
+
+REINDEX INDEX test_reindex_pkey;
+
+-- Works fine - index rebuilt through storage hooks
+SET enable_seqscan = off;
+SELECT * FROM test_reindex WHERE id = 50;
+RESET enable_seqscan;
+
+DROP TABLE test_reindex;
+
+-- =============================================================================
+-- Additional DDL Tests: Operations that change RelFileNumber or copy files
+-- These also go through storage hooks, so encryption/decryption works correctly.
+-- =============================================================================
+
+-- -----------------------------------------------------------------------------
+-- Test 5: ALTER TABLE SET TABLESPACE
+-- RelFileNumber changes, but data is copied through storage hooks
+-- -----------------------------------------------------------------------------
+\! mkdir -p /tmp/test_tde_tablespace
+CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+
+CREATE TABLE test_set_tablespace (id int, data text);
+INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
+CHECKPOINT;
+
+-- Move to different tablespace - data copied through storage hooks
+ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+
+-- Works fine - data was re-encrypted with new RelFileNumber
+SELECT COUNT(*) FROM test_set_tablespace;
+
+DROP TABLE test_set_tablespace;
+DROP TABLESPACE test_tde_tblspc;
+\! rm -rf /tmp/test_tde_tablespace
diff --git a/contrib/test_tde/test_tde.c b/contrib/test_tde/test_tde.c
new file mode 100644
index 00000000000..f70359f1c26
--- /dev/null
+++ b/contrib/test_tde/test_tde.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_tde.c
+ * Reference implementation for Storage I/O Transform Hooks
+ *
+ * WARNING: This is for TESTING ONLY. Do not use in production.
+ * - Key stored in plaintext GUC
+ * - No key rotation
+ * - Minimal error handling
+ * - Not audited for security
+ *
+ * For production TDE, use a dedicated extension project.
+ *
+ * This extension demonstrates how to use the storage I/O transform hooks
+ * for transparent data encryption. It uses AES-256-CTR for encryption
+ * with IV derived from page metadata and block location.
+ *
+ * Author: Henson Choi <assam258@gmail.com>
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/test_tde/test_tde.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <openssl/err.h>
+#include <openssl/evp.h>
+#include <string.h>
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "catalog/pg_tablespace_d.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "storage/md.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_tde",
+ .version = PG_VERSION
+);
+
+/* ----------
+ * GUC variables
+ * ----------
+ */
+static char *test_tde_key_hex = NULL; /* 64 hex chars = 256 bits */
+
+/* ----------
+ * Module state
+ * ----------
+ */
+
+/*
+ * Memory context for encryption buffers.
+ * Allows allocation in critical sections (for WAL encryption).
+ */
+static MemoryContext test_tde_cxt = NULL;
+
+/*
+ * Transform ID for this extension.
+ * Value 1 means page is encrypted with test_tde.
+ * Value 0 means page is not transformed (plaintext).
+ */
+#define TEST_TDE_TRANSFORM_ID 1
+
+/*
+ * Dynamic buffers for encrypted pages.
+ * Grows as needed, freed in _PG_fini.
+ */
+static char *encrypt_buffer = NULL;
+static const void **encrypt_buffer_ptrs = NULL;
+static BlockNumber encrypt_buffer_nblocks = 0;
+
+/*
+ * WAL encryption buffer - allocated from test_tde_cxt which allows
+ * allocation in critical sections via MemoryContextAllowInCriticalSection().
+ */
+static char *wal_encrypt_buffer = NULL;
+static Size wal_encrypt_buffer_size = 0;
+
+/*
+ * WAL decryption buffer - static, only needed for records within a single page.
+ * When inplace_allowed=false, record doesn't cross page boundary, so max size
+ * is XLOG_BLCKSZ.
+ */
+static char wal_decrypt_buffer[XLOG_BLCKSZ];
+
+/*
+ * Pre-allocated OpenSSL cipher context.
+ * Created in _PG_init() and reused for all encrypt/decrypt operations.
+ * This avoids memory allocation in critical sections.
+ */
+static EVP_CIPHER_CTX *cipher_ctx = NULL;
+
+/*
+ * Transformed WAL record structure (using XLR_BLOCK_ID_TRANSFORMED from xlogrecord.h):
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV (16B)]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as transformed. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ *
+ * Note: The 21-byte overhead may temporarily cause xl_tot_len to exceed
+ * XLogRecordMaxSize after encryption. This is safe because:
+ * - XLogRecordMaxSize is only checked in XLogRecordAssemble() before our hook
+ * - XLogInsertRecord() does not re-validate the size
+ * - The decode hook removes the overhead before WAL parsing, restoring the
+ * original size which was already validated
+ */
+#define WAL_ENCRYPT_IV_SIZE 16
+#define WAL_ENCRYPT_OVERHEAD (SizeOfXLogRecordDataHeaderLong + WAL_ENCRYPT_IV_SIZE)
+#define WAL_CRC_SIZE sizeof(pg_crc32c) /* 4 bytes */
+#define WAL_IV_RANDOM_SIZE (WAL_ENCRYPT_IV_SIZE - WAL_CRC_SIZE) /* 12 bytes */
+
+/* Static XLogRecData for returning encrypted WAL */
+static XLogRecData wal_rdata_head;
+
+/* Previous hook values (for chaining) */
+static mdread_post_hook_type prev_mdread_post_hook = NULL;
+static mdwrite_pre_hook_type prev_mdwrite_pre_hook = NULL;
+static mdextend_pre_hook_type prev_mdextend_pre_hook = NULL;
+static xlog_insert_pre_hook_type prev_xlog_insert_pre_hook = NULL;
+static xlog_decode_pre_hook_type prev_xlog_decode_pre_hook = NULL;
+
+/* ----------
+ * Function declarations
+ * ----------
+ */
+
+/* Module entry points */
+void _PG_init(void);
+void _PG_fini(void);
+
+/* GUC callbacks */
+static bool check_test_tde_key(char **newval, void **extra, GucSource source);
+
+/* Hook functions */
+static void test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks);
+static const void **test_tde_mdwrite_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers,
+ BlockNumber nblocks);
+static const void *test_tde_mdextend_pre(RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ const void *buffer);
+static struct XLogRecData *test_tde_xlog_insert_pre(struct XLogRecData *rdata);
+static XLogRecord *test_tde_xlog_decode_pre(XLogReaderState *state,
+ XLogRecord *record,
+ XLogRecPtr lsn,
+ bool inplace_allowed);
+
+/* Internal helper functions */
+static void ensure_encrypt_buffer(BlockNumber nblocks);
+static bool parse_hex_key(const char *hex, unsigned char *out, int outlen);
+static void derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn);
+static void transform_data(const unsigned char *in, unsigned char *out,
+ int len, const unsigned char *iv);
+static bool should_transform(RelFileLocator *rlocator, ForkNumber forknum);
+
+
+/* ----------
+ * Internal helper functions
+ * ----------
+ */
+
+/*
+ * Parse hex string to bytes
+ */
+static bool
+parse_hex_key(const char *hex, unsigned char *out, int outlen)
+{
+ int i;
+ int hexlen;
+
+ if (hex == NULL)
+ return false;
+
+ hexlen = strlen(hex);
+ if (hexlen != outlen * 2)
+ return false;
+
+ for (i = 0; i < outlen; i++)
+ {
+ int hi,
+ lo;
+ char c;
+
+ c = hex[i * 2];
+ if (c >= '0' && c <= '9')
+ hi = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ hi = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ hi = c - 'A' + 10;
+ else
+ return false;
+
+ c = hex[i * 2 + 1];
+ if (c >= '0' && c <= '9')
+ lo = c - '0';
+ else if (c >= 'a' && c <= 'f')
+ lo = c - 'a' + 10;
+ else if (c >= 'A' && c <= 'F')
+ lo = c - 'A' + 10;
+ else
+ return false;
+
+ out[i] = (hi << 4) | lo;
+ }
+
+ return true;
+}
+
+/*
+ * Ensure encrypt buffer can hold 'nblocks' pages.
+ * Grows by 2x when needed. Uses test_tde_cxt for persistence.
+ */
+static void
+ensure_encrypt_buffer(BlockNumber nblocks)
+{
+ if (encrypt_buffer_nblocks >= nblocks)
+ return;
+
+ if (encrypt_buffer == NULL)
+ {
+ BlockNumber initial = Max(8, nblocks);
+ Size size = (Size) initial * BLCKSZ;
+
+ encrypt_buffer = MemoryContextAllocAligned(test_tde_cxt, size,
+ PG_IO_ALIGN_SIZE, 0);
+ encrypt_buffer_ptrs = MemoryContextAlloc(test_tde_cxt,
+ initial * sizeof(void *));
+ encrypt_buffer_nblocks = initial;
+ }
+ else
+ {
+ BlockNumber new_nblocks = encrypt_buffer_nblocks;
+ Size new_size;
+
+ while (new_nblocks < nblocks)
+ new_nblocks *= 2;
+
+ new_size = (Size) new_nblocks * BLCKSZ;
+
+ /* repalloc doesn't preserve alignment, so allocate new and copy */
+ {
+ char *new_buffer = MemoryContextAllocAligned(test_tde_cxt,
+ new_size,
+ PG_IO_ALIGN_SIZE, 0);
+
+ memcpy(new_buffer, encrypt_buffer,
+ (Size) encrypt_buffer_nblocks * BLCKSZ);
+ pfree(encrypt_buffer);
+ encrypt_buffer = new_buffer;
+ }
+
+ encrypt_buffer_ptrs = repalloc(encrypt_buffer_ptrs,
+ new_nblocks * sizeof(void *));
+ encrypt_buffer_nblocks = new_nblocks;
+ }
+
+ /* Update pointers array */
+ for (BlockNumber i = 0; i < encrypt_buffer_nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_buffer + (Size) i * BLCKSZ;
+}
+
+
+/*
+ * Derive IV from page location and header
+ *
+ * IV structure (16 bytes) - simple, deterministic layout:
+ *
+ * AES-CTR mode only requires IV uniqueness, not randomness.
+ * The combination of LSN + RelFileNumber + BlockNumber guarantees uniqueness:
+ * - LSN: Globally unique across entire WAL stream
+ * - RelFileNumber: Unique within database
+ * - BlockNumber: Unique within relation
+ *
+ * Even when a single WAL record modifies multiple pages (e.g., B-tree split),
+ * the BlockNumber distinguishes each page.
+ *
+ * Layout (high entropy bytes first, low entropy bytes last for CTR counter space):
+ * [0-3] LSN low 32 bits - changes frequently (high entropy)
+ * [4-5] LSN bits 32-47 - mid entropy
+ * [6-8] BlockNumber low 24 bits
+ * [9-11] RelFileNumber low 24 bits
+ * [12] BlockNumber high 8 bits - usually 0 for small tables
+ * [13] RelFileNumber high 8 bits - usually 0
+ * [14-15] LSN bits 48-63 - usually 0, counter space for CTR
+ *
+ * CTR counter space analysis:
+ * - Page size: 8KB, encrypted area: 8168 bytes (excluding 24-byte header)
+ * - AES block size: 16 bytes
+ * - Counter increments per page: 8168/16 = 511 (0x1FF)
+ * - Counter affects only IV[14-15] (max increment 0x1FF < 0x10000)
+ * - Bytes 12-15 provide 2^32 counter space, far exceeding 511 needed
+ * - Collision requires same IV[0-11], which means same LSN+BlockNum+RelNum
+ *
+ * Note: spcOid, dbOid not used - RelFileNumber is sufficient for uniqueness.
+ *
+ * Known limitation: Operations that copy/move files while changing
+ * RelFileNumber without going through storage hooks cause decryption failure.
+ */
+static void
+derive_iv(unsigned char *iv, RelFileLocator *rlocator,
+ BlockNumber blocknum, XLogRecPtr lsn)
+{
+
+ /*
+ * Layout: High entropy first, low entropy (usually 0) last.
+ * [LSN low 4B][LSN mid 2B][BlockNum low 3B][RelNum low 3B]
+ * [BlockNum high 1B][RelNum high 1B][LSN high 2B]
+ */
+
+ /* LSN low 32 bits - bytes 0-3 (high entropy, changes frequently) */
+ iv[0] = (uint8) ((lsn >> 0) & 0xFF);
+ iv[1] = (uint8) ((lsn >> 8) & 0xFF);
+ iv[2] = (uint8) ((lsn >> 16) & 0xFF);
+ iv[3] = (uint8) ((lsn >> 24) & 0xFF);
+
+ /* LSN bits 32-47 - bytes 4-5 (mid entropy) */
+ iv[4] = (uint8) ((lsn >> 32) & 0xFF);
+ iv[5] = (uint8) ((lsn >> 40) & 0xFF);
+
+ /* BlockNumber low 24 bits - bytes 6-8 */
+ iv[6] = (uint8) ((blocknum >> 0) & 0xFF);
+ iv[7] = (uint8) ((blocknum >> 8) & 0xFF);
+ iv[8] = (uint8) ((blocknum >> 16) & 0xFF);
+
+ /* RelFileNumber low 24 bits - bytes 9-11 */
+ iv[9] = (uint8) ((rlocator->relNumber >> 0) & 0xFF);
+ iv[10] = (uint8) ((rlocator->relNumber >> 8) & 0xFF);
+ iv[11] = (uint8) ((rlocator->relNumber >> 16) & 0xFF);
+
+ /* BlockNumber high 8 bits - byte 12 (usually 0 for small tables) */
+ iv[12] = (uint8) ((blocknum >> 24) & 0xFF);
+
+ /* RelFileNumber high 8 bits - byte 13 (usually 0) */
+ iv[13] = (uint8) ((rlocator->relNumber >> 24) & 0xFF);
+
+ /* LSN bits 48-63 - bytes 14-15 (usually 0, counter space for CTR) */
+ iv[14] = (uint8) ((lsn >> 48) & 0xFF);
+ iv[15] = (uint8) ((lsn >> 56) & 0xFF);
+}
+
+/*
+ * Encrypt or decrypt data using AES-256-CTR
+ *
+ * AES-CTR is symmetric: encrypt and decrypt use the same operation.
+ */
+static void
+transform_data(const unsigned char *in, unsigned char *out, int len,
+ const unsigned char *iv)
+{
+ int outlen,
+ tmplen;
+
+ if (len <= 0)
+ return;
+
+ /*
+ * cipher_ctx is pre-allocated and initialized with cipher/key in _PG_init().
+ * Here we only set IV (cipher=NULL, key=NULL), which avoids internal
+ * memory allocation. This is critical for WAL encryption which runs
+ * inside critical sections. We use PANIC for all errors.
+ */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: cipher context not initialized")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, NULL, NULL, NULL, iv) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptInit_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptUpdate(cipher_ctx, out, &outlen, in, len) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptUpdate failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+
+ if (EVP_EncryptFinal_ex(cipher_ctx, out + outlen, &tmplen) != 1)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: EVP_EncryptFinal_ex failed: %s",
+ ERR_error_string(ERR_get_error(), NULL))));
+}
+
+/*
+ * Check if we should encrypt/decrypt this relation
+ *
+ * For this test implementation, we encrypt only user-created relations.
+ * A production implementation would check encryption policies.
+ */
+static bool
+should_transform(RelFileLocator *rlocator, ForkNumber forknum)
+{
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return false;
+
+ /* Skip system catalog tablespace (pg_global) */
+ if (rlocator->spcOid == GLOBALTABLESPACE_OID)
+ return false;
+
+ /*
+ * Skip system catalogs (OID < FirstNormalObjectId). This ensures we don't
+ * try to encrypt/decrypt pre-existing system catalog pages that were
+ * created without encryption.
+ */
+ if (rlocator->relNumber < FirstNormalObjectId)
+ return false;
+
+ (void) forknum; /* all forks are encrypted for user tables */
+
+ return true;
+}
+
+
+/* ----------
+ * Hook functions - Page I/O
+ * ----------
+ */
+
+/*
+ * Post-read hook: decrypt blocks after reading from disk
+ */
+static void
+test_tde_mdread_post(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+ unsigned char iv[16];
+
+ /* Chain to previous hook if any */
+ if (prev_mdread_post_hook)
+ prev_mdread_post_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ {
+ PageHeader phdr = (PageHeader) buffers[i];
+ uint16 checksum;
+ uint8 transform_id;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffers[i]))
+ continue;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ continue;
+
+ /* Check transform ID - skip if page is not encrypted by us */
+ transform_id = PageGetTransformId((Page) buffers[i]);
+ if (transform_id == PD_TRANSFORM_NONE)
+ continue; /* Page is not encrypted */
+
+ if (transform_id != TEST_TDE_TRANSFORM_ID)
+ {
+ elog(DEBUG1, "test_tde: skipping block %u with transform ID %u (not ours)",
+ blocknum + i, transform_id);
+ continue;
+ }
+
+ /* Page is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted page found but encryption key not configured"),
+ errdetail("Block %u of relation %u/%u/%u fork %d has transform ID %u.",
+ blocknum + i, rlocator->spcOid, rlocator->dbOid,
+ rlocator->relNumber, forknum, transform_id)));
+
+ /* Verify checksum on encrypted data before decryption */
+ if (DataChecksumsEnabled())
+ {
+ checksum = pg_checksum_page((char *) buffers[i], blocknum + i);
+ if (checksum != phdr->pd_checksum)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("page verification failed, calculated checksum %u but expected %u",
+ checksum, phdr->pd_checksum)));
+ }
+ }
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum + i, PageGetLSN((Page) buffers[i]));
+
+ /* Decrypt data area in place (header stays unchanged) */
+ transform_data((unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ (unsigned char *) buffers[i] + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Clear transform ID and recalculate checksum for plaintext data */
+ PageSetTransformId((Page) buffers[i], PD_TRANSFORM_NONE);
+ PageSetChecksumInplace((Page) buffers[i], blocknum + i);
+ }
+}
+
+/*
+ * Helper: encrypt a single page into the encrypt_buffer at given offset.
+ * Returns pointer to encrypted page, or original buffer if page was skipped.
+ */
+static const void *
+encrypt_page(RelFileLocator *rlocator, BlockNumber blocknum,
+ const void *buffer, Size buffer_offset)
+{
+ unsigned char iv[16];
+ PageHeader phdr = (PageHeader) buffer;
+ char *dest = encrypt_buffer + buffer_offset;
+
+ /* Skip empty/new pages */
+ if (PageIsNew((Page) buffer))
+ return buffer;
+
+ /* Skip if page doesn't look valid */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ return buffer;
+
+ /* Derive IV using LSN from page header */
+ derive_iv(iv, rlocator, blocknum, PageGetLSN((Page) buffer));
+
+ /* Copy header, encrypt data area */
+ memcpy(dest, buffer, SizeOfPageHeaderData);
+ transform_data((unsigned char *) buffer + SizeOfPageHeaderData,
+ (unsigned char *) dest + SizeOfPageHeaderData,
+ BLCKSZ - SizeOfPageHeaderData, iv);
+
+ /* Set transform ID to mark page as encrypted */
+ PageSetTransformId((Page) dest, TEST_TDE_TRANSFORM_ID);
+
+ /* Recalculate checksum for encrypted data */
+ PageSetChecksumInplace((Page) dest, blocknum);
+
+ return dest;
+}
+
+/*
+ * Pre-write hook: encrypt blocks before writing to disk
+ */
+static const void **
+test_tde_mdwrite_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void **buffers,
+ BlockNumber nblocks)
+{
+ BlockNumber i;
+
+ /* Chain to previous hook if any */
+ if (prev_mdwrite_pre_hook)
+ buffers = prev_mdwrite_pre_hook(rlocator, forknum, blocknum, buffers, nblocks);
+
+ if (!should_transform(rlocator, forknum))
+ return buffers;
+
+ /* Ensure buffer is large enough */
+ ensure_encrypt_buffer(nblocks);
+
+ for (i = 0; i < nblocks; i++)
+ encrypt_buffer_ptrs[i] = encrypt_page(rlocator, blocknum + i,
+ buffers[i], (Size) i * BLCKSZ);
+
+ return encrypt_buffer_ptrs;
+}
+
+/*
+ * Pre-extend hook: encrypt block before extending relation
+ */
+static const void *
+test_tde_mdextend_pre(RelFileLocator *rlocator, ForkNumber forknum,
+ BlockNumber blocknum, const void *buffer)
+{
+ /* Chain to previous hook if any */
+ if (prev_mdextend_pre_hook)
+ buffer = prev_mdextend_pre_hook(rlocator, forknum, blocknum, buffer);
+
+ if (!should_transform(rlocator, forknum))
+ return buffer;
+
+ /* Ensure buffer is large enough for at least 1 block */
+ ensure_encrypt_buffer(1);
+
+ return encrypt_page(rlocator, blocknum, buffer, 0);
+}
+
+
+/* ----------
+ * Hook functions - WAL I/O
+ * ----------
+ */
+
+/*
+ * Ensure WAL encryption buffer is large enough.
+ * Uses test_tde_cxt which allows allocation in critical sections.
+ */
+static void
+ensure_wal_encrypt_buffer(Size needed)
+{
+ if (wal_encrypt_buffer_size >= needed)
+ return;
+
+ if (wal_encrypt_buffer == NULL)
+ wal_encrypt_buffer = MemoryContextAlloc(test_tde_cxt, needed);
+ else
+ wal_encrypt_buffer = repalloc(wal_encrypt_buffer, needed);
+ wal_encrypt_buffer_size = needed;
+}
+
+/*
+ * WAL insert pre-hook: encrypt WAL record data
+ *
+ * Strategy:
+ * 1. Copy XLogRecord header and payload
+ * 2. Save plaintext CRC from header (xl_crc contains payload CRC at this point)
+ * 3. Build IV: [plaintext CRC (4B)] [random (12B)]
+ * 4. Insert transformation header (block ID 251 + payload_length) and IV
+ * 5. Encrypt original payload with the IV
+ * 6. Update xl_tot_len and recalculate CRC for encrypted payload
+ *
+ * Resulting record structure:
+ * [XLogRecord header]
+ * [block_id=251 (1B)]
+ * [payload_length (4B)]
+ * [IV 16B]
+ * [encrypted payload]
+ *
+ * The block ID 251 marks this record as encrypted. After decryption,
+ * the marker, length, and IV are removed, restoring the original structure.
+ * If decryption is not performed, the unknown block ID causes parse failure.
+ */
+static struct XLogRecData *
+test_tde_xlog_insert_pre(struct XLogRecData *rdata)
+{
+ XLogRecData *node;
+ XLogRecord *rechdr;
+ char *bufptr;
+ char *new_payload_start;
+ uint32 orig_total_len;
+ uint32 orig_payload_len;
+ uint32 new_total_len;
+ uint32 transform_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ pg_crc32c plaintext_crc;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_insert_pre_hook)
+ rdata = prev_xlog_insert_pre_hook(rdata);
+
+ /* Skip if cipher not initialized (key not configured) */
+ if (cipher_ctx == NULL)
+ return rdata;
+
+ /* First node must contain XLogRecord header */
+ if (rdata == NULL || rdata->data == NULL || rdata->len < SizeOfXLogRecord)
+ return rdata;
+
+ rechdr = (XLogRecord *) rdata->data;
+ orig_total_len = rechdr->xl_tot_len;
+ orig_payload_len = orig_total_len - SizeOfXLogRecord;
+
+ /* Sanity check */
+ if (orig_total_len < SizeOfXLogRecord)
+ return rdata;
+
+ /*
+ * Skip records with no payload (e.g., XLOG_SWITCH). These are header-only
+ * records where adding encryption overhead would break size assertions.
+ */
+ if (orig_payload_len == 0)
+ return rdata;
+
+ new_total_len = orig_total_len + WAL_ENCRYPT_OVERHEAD;
+
+ /*
+ * Save plaintext CRC before we modify anything.
+ * At this point, xl_crc contains the CRC of the payload only
+ * (header CRC is added later by XLogInsertRecord).
+ */
+ plaintext_crc = rechdr->xl_crc;
+
+ /*
+ * Ensure buffer is large enough. test_tde_cxt allows allocation in
+ * critical sections, so this is safe even during WAL insertion.
+ * OOM here will cause PANIC, which is acceptable for critical sections.
+ */
+ ensure_wal_encrypt_buffer(new_total_len);
+
+ /*
+ * Build IV: [plaintext CRC (4B)] [random (12B)]
+ * Store CRC directly in IV[0..3] (little-endian).
+ */
+ iv[0] = ((uint32) plaintext_crc >> 0) & 0xFF;
+ iv[1] = ((uint32) plaintext_crc >> 8) & 0xFF;
+ iv[2] = ((uint32) plaintext_crc >> 16) & 0xFF;
+ iv[3] = ((uint32) plaintext_crc >> 24) & 0xFF;
+
+ /* Generate random bytes for IV[4..15] (12 bytes) for uniqueness */
+ if (!pg_strong_random(iv + WAL_CRC_SIZE, WAL_IV_RANDOM_SIZE))
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: failed to generate random IV for WAL")));
+ return rdata;
+ }
+
+ /*
+ * Build encrypted record in buffer:
+ * [header][block_id][payload_length][IV][encrypted_payload]
+ */
+ bufptr = wal_encrypt_buffer;
+
+ /* 1. Copy header from first rdata node */
+ memcpy(bufptr, rdata->data, SizeOfXLogRecord);
+ bufptr += SizeOfXLogRecord;
+
+ /* 2. Insert transformation header (block ID 251 + payload_length) */
+ new_payload_start = bufptr;
+ *bufptr = (char) XLR_BLOCK_ID_TRANSFORMED;
+ bufptr += sizeof(uint8);
+
+ /* Calculate payload_length: IV + encrypted payload */
+ transform_payload_len = WAL_ENCRYPT_IV_SIZE + orig_payload_len;
+
+ /* Store payload_length (4 bytes, unaligned, little-endian) */
+ bufptr[0] = (char) ((transform_payload_len >> 0) & 0xFF);
+ bufptr[1] = (char) ((transform_payload_len >> 8) & 0xFF);
+ bufptr[2] = (char) ((transform_payload_len >> 16) & 0xFF);
+ bufptr[3] = (char) ((transform_payload_len >> 24) & 0xFF);
+ bufptr += sizeof(uint32);
+
+ /* 3. Insert IV (CRC in first 4 bytes, random in remaining 12) */
+ memcpy(bufptr, iv, WAL_ENCRYPT_IV_SIZE);
+ bufptr += WAL_ENCRYPT_IV_SIZE;
+
+ /* 4. Copy payload to buffer, then encrypt in-place */
+ if (orig_payload_len > 0)
+ {
+ Size first_node_payload;
+ char *encrypt_start = bufptr;
+
+ /* First node: skip header, copy remaining payload */
+ first_node_payload = rdata->len - SizeOfXLogRecord;
+ if (first_node_payload > 0)
+ {
+ memcpy(bufptr, (char *) rdata->data + SizeOfXLogRecord, first_node_payload);
+ bufptr += first_node_payload;
+ }
+
+ /* Remaining nodes: copy all data */
+ for (node = rdata->next; node != NULL; node = node->next)
+ {
+ if (node->len > 0 && node->data != NULL)
+ {
+ memcpy(bufptr, node->data, node->len);
+ bufptr += node->len;
+ }
+ }
+
+ /* Encrypt payload in-place */
+ transform_data((unsigned char *) encrypt_start,
+ (unsigned char *) encrypt_start,
+ orig_payload_len, iv);
+ }
+
+ /* Update header with new total length */
+ rechdr = (XLogRecord *) wal_encrypt_buffer;
+ rechdr->xl_tot_len = new_total_len;
+
+ /*
+ * Recalculate CRC for the new payload (marker + length + IV + encrypted data).
+ * The header CRC will be added by XLogInsertRecord later.
+ */
+ {
+ pg_crc32c crc;
+
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, new_payload_start, new_total_len - SizeOfXLogRecord);
+ rechdr->xl_crc = crc;
+ }
+
+ /* Return single XLogRecData pointing to our encrypted buffer */
+ wal_rdata_head.next = NULL;
+ wal_rdata_head.data = wal_encrypt_buffer;
+ wal_rdata_head.len = new_total_len;
+
+ return &wal_rdata_head;
+}
+
+/*
+ * WAL decode pre-hook: decrypt WAL record data
+ *
+ * This reverses the encryption done in xlog_insert_pre_hook.
+ * Checks for block ID 251 marker to identify encrypted records.
+ *
+ * Input: [header] [block_id=251 (1B)] [payload_length (4B)] [IV 16B] [encrypted payload]
+ * Output: [header] [original payload] (shorter by 21 bytes)
+ *
+ * Recovery process:
+ * 1. Check for encryption marker (block ID 251)
+ * 2. Read payload_length from transform header
+ * 3. Extract IV for decryption
+ * 4. Decrypt payload using IV
+ * 5. Extract plaintext payload CRC from IV[0..3]
+ * 6. Restore original record structure
+ *
+ * If the marker is not found, record is not encrypted (pass through).
+ * If inplace_allowed, decrypts in place. Otherwise, copies to static buffer.
+ */
+static XLogRecord *
+test_tde_xlog_decode_pre(XLogReaderState *state, XLogRecord *record,
+ XLogRecPtr lsn, bool inplace_allowed)
+{
+ uint32 total_len;
+ uint32 transform_payload_len;
+ uint32 encrypted_payload_len;
+ unsigned char iv[WAL_ENCRYPT_IV_SIZE];
+ char *payload_start;
+ char *len_ptr;
+ XLogRecord *work_record;
+
+ /* Chain to previous hook if any */
+ if (prev_xlog_decode_pre_hook)
+ record = prev_xlog_decode_pre_hook(state, record, lsn, inplace_allowed);
+
+ if (record == NULL)
+ return record;
+
+ total_len = record->xl_tot_len;
+
+ /* Must have at least header + transform header + IV */
+ if (total_len < SizeOfXLogRecord + WAL_ENCRYPT_OVERHEAD)
+ return record;
+
+ /* Check for transformation marker (block ID 251) */
+ payload_start = (char *) record + SizeOfXLogRecord;
+ if ((unsigned char) *payload_start != XLR_BLOCK_ID_TRANSFORMED)
+ return record; /* Not transformed, pass through */
+
+ /* WAL is encrypted but cipher not initialized - fatal error */
+ if (cipher_ctx == NULL)
+ ereport(PANIC,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: encrypted WAL record found but encryption key not configured"),
+ errdetail("WAL record at LSN %X/%X has transformation marker.",
+ LSN_FORMAT_ARGS(lsn))));
+
+ /*
+ * If inplace modification allowed, work directly on record. Otherwise,
+ * copy to static buffer (record fits in single page).
+ */
+ if (inplace_allowed)
+ {
+ work_record = record;
+ }
+ else
+ {
+ /* Record within single page, must fit in XLOG_BLCKSZ */
+ if (total_len > XLOG_BLCKSZ)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: WAL record too large for decryption buffer")));
+ return record;
+ }
+ memcpy(wal_decrypt_buffer, record, total_len);
+ work_record = (XLogRecord *) wal_decrypt_buffer;
+ }
+
+ /* Recalculate payload_start for work_record */
+ payload_start = (char *) work_record + SizeOfXLogRecord;
+
+ /* Read payload_length from transform header (4 bytes, unaligned, little-endian) */
+ len_ptr = payload_start + sizeof(uint8);
+ transform_payload_len = ((uint32) (unsigned char) len_ptr[0] << 0) |
+ ((uint32) (unsigned char) len_ptr[1] << 8) |
+ ((uint32) (unsigned char) len_ptr[2] << 16) |
+ ((uint32) (unsigned char) len_ptr[3] << 24);
+
+ /* Validate payload_length */
+ if (transform_payload_len < WAL_ENCRYPT_IV_SIZE ||
+ transform_payload_len > total_len - SizeOfXLogRecord - SizeOfXLogRecordDataHeaderLong)
+ {
+ ereport(WARNING,
+ (errmsg("test_tde: invalid transform payload length %u at LSN %X/%X",
+ transform_payload_len, LSN_FORMAT_ARGS(lsn))));
+ return record;
+ }
+
+ /* Extract IV (after transform header) */
+ memcpy(iv, payload_start + SizeOfXLogRecordDataHeaderLong, WAL_ENCRYPT_IV_SIZE);
+
+ /* Encrypted payload length = transform_payload_len - IV */
+ encrypted_payload_len = transform_payload_len - WAL_ENCRYPT_IV_SIZE;
+
+ /*
+ * Decrypt payload directly to payload_start position, removing header and IV.
+ * Source: payload_start + 21 (encrypted data after transform header + IV)
+ * Dest: payload_start (overwrite transform header with decrypted data)
+ */
+ if (encrypted_payload_len > 0)
+ {
+ transform_data((unsigned char *) (payload_start + WAL_ENCRYPT_OVERHEAD),
+ (unsigned char *) payload_start,
+ encrypted_payload_len, iv);
+ }
+
+ /* Update header with original length (transform header and IV removed) */
+ work_record->xl_tot_len = SizeOfXLogRecord + encrypted_payload_len;
+
+ /*
+ * Recover plaintext payload CRC from IV[0..3] (little-endian).
+ */
+ {
+ pg_crc32c recovered_payload_crc;
+ pg_crc32c full_crc;
+
+ /* Extract CRC directly from IV[0..3] */
+ recovered_payload_crc = (pg_crc32c) (((uint32) iv[0] << 0) |
+ ((uint32) iv[1] << 8) |
+ ((uint32) iv[2] << 16) |
+ ((uint32) iv[3] << 24));
+
+ /*
+ * For ValidXLogRecord(), we need CRC of: payload + header (up to xl_crc)
+ * The recovered CRC is payload-only, so add header portion.
+ */
+ full_crc = recovered_payload_crc;
+ COMP_CRC32C(full_crc, (char *) work_record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32C(full_crc);
+ work_record->xl_crc = full_crc;
+ }
+
+ return work_record;
+}
+
+
+/* ----------
+ * GUC callbacks
+ * ----------
+ */
+
+/*
+ * GUC check hook for key
+ */
+static bool
+check_test_tde_key(char **newval, void **extra, GucSource source)
+{
+ if (*newval == NULL || strlen(*newval) == 0)
+ return true;
+
+ if (strlen(*newval) != 64)
+ {
+ GUC_check_errdetail("Key must be exactly 64 hex characters (256 bits).");
+ return false;
+ }
+
+ /* Validate hex characters */
+ for (int i = 0; i < 64; i++)
+ {
+ char c = (*newval)[i];
+
+ if (!((c >= '0' && c <= '9') ||
+ (c >= 'a' && c <= 'f') ||
+ (c >= 'A' && c <= 'F')))
+ {
+ GUC_check_errdetail("Key must contain only hex characters (0-9, a-f, A-F).");
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/* ----------
+ * Module entry points
+ * ----------
+ */
+
+/*
+ * Module initialization
+ */
+void
+_PG_init(void)
+{
+ unsigned char key[32];
+
+ /*
+ * Create memory context for encryption buffers and allow allocation
+ * in critical sections. This is necessary because WAL encryption runs
+ * inside critical sections, and OOM there will cause PANIC anyway.
+ */
+ test_tde_cxt = AllocSetContextCreate(TopMemoryContext,
+ "test_tde",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextAllowInCriticalSection(test_tde_cxt, true);
+
+ /*
+ * Define GUC for encryption key.
+ *
+ * PGC_POSTMASTER: Key can only be set at server start to prevent
+ * accidental runtime changes.
+ *
+ * WARNING: Once data is encrypted with a key, that same key MUST be used
+ * for the lifetime of the data. Changing the key (even across restarts)
+ * will cause decryption failures and data corruption. This reference
+ * implementation does not support key rotation.
+ */
+ DefineCustomStringVariable("test_tde.key",
+ "Encryption key in hex format (64 characters = 256 bits).",
+ "WARNING: Key must never change once data is encrypted!",
+ &test_tde_key_hex,
+ "",
+ PGC_POSTMASTER,
+ GUC_SUPERUSER_ONLY,
+ check_test_tde_key,
+ NULL,
+ NULL);
+
+ MarkGUCPrefixReserved("test_tde");
+
+ /*
+ * Parse key and initialize cipher context if key is configured.
+ * cipher_ctx remains NULL if no key is set, disabling encryption.
+ */
+ if (test_tde_key_hex != NULL && strlen(test_tde_key_hex) == 64)
+ {
+ if (!parse_hex_key(test_tde_key_hex, key, 32))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("test_tde: failed to parse encryption key")));
+
+ cipher_ctx = EVP_CIPHER_CTX_new();
+ if (!cipher_ctx)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("test_tde: failed to create cipher context")));
+
+ if (EVP_EncryptInit_ex(cipher_ctx, EVP_aes_256_ctr(), NULL, key, NULL) != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("test_tde: failed to initialize cipher context")));
+
+ /* Clear key from stack */
+ explicit_bzero(key, sizeof(key));
+ }
+
+ /* Install hooks (save previous values for chaining) */
+ prev_mdread_post_hook = mdread_post_hook;
+ mdread_post_hook = test_tde_mdread_post;
+
+ prev_mdwrite_pre_hook = mdwrite_pre_hook;
+ mdwrite_pre_hook = test_tde_mdwrite_pre;
+
+ prev_mdextend_pre_hook = mdextend_pre_hook;
+ mdextend_pre_hook = test_tde_mdextend_pre;
+
+ prev_xlog_insert_pre_hook = xlog_insert_pre_hook;
+ xlog_insert_pre_hook = test_tde_xlog_insert_pre;
+
+ prev_xlog_decode_pre_hook = xlog_decode_pre_hook;
+ xlog_decode_pre_hook = test_tde_xlog_decode_pre;
+
+ ereport(LOG,
+ (errmsg("test_tde: initialized (WARNING: for testing only!)")));
+}
+
+/*
+ * Module finalization
+ */
+void
+_PG_fini(void)
+{
+ /* Restore previous hooks */
+ xlog_decode_pre_hook = prev_xlog_decode_pre_hook;
+ xlog_insert_pre_hook = prev_xlog_insert_pre_hook;
+ mdextend_pre_hook = prev_mdextend_pre_hook;
+ mdwrite_pre_hook = prev_mdwrite_pre_hook;
+ mdread_post_hook = prev_mdread_post_hook;
+
+ /* Free OpenSSL cipher context (also clears key material) */
+ if (cipher_ctx != NULL)
+ {
+ EVP_CIPHER_CTX_free(cipher_ctx);
+ cipher_ctx = NULL;
+ }
+
+ /*
+ * Delete memory context - this frees all buffers allocated from it
+ * (encrypt_buffer, encrypt_buffer_ptrs, wal_encrypt_buffer).
+ */
+ if (test_tde_cxt != NULL)
+ {
+ MemoryContextDelete(test_tde_cxt);
+ test_tde_cxt = NULL;
+ }
+
+ /* Reset buffer pointers */
+ encrypt_buffer = NULL;
+ encrypt_buffer_ptrs = NULL;
+ encrypt_buffer_nblocks = 0;
+ wal_encrypt_buffer = NULL;
+ wal_encrypt_buffer_size = 0;
+}
diff --git a/contrib/test_tde/test_tde.conf b/contrib/test_tde/test_tde.conf
new file mode 100644
index 00000000000..0b00366474c
--- /dev/null
+++ b/contrib/test_tde/test_tde.conf
@@ -0,0 +1,2 @@
+shared_preload_libraries = 'test_tde'
+test_tde.key = '0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef'
--
2.50.1 (Apple Git-155)
v20251229-v4-0003-Fix-test_tde-tablespace-test-for-cross-platform-compatibility.patchapplication/octet-stream; name=v20251229-v4-0003-Fix-test_tde-tablespace-test-for-cross-platform-compatibility.patchDownload
From 800bcca5ac31ada872416282acbb523ca43f464b Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Mon, 29 Dec 2025 12:28:39 +0900
Subject: [PATCH v4 v4 3/3] Fix test_tde tablespace test for cross-platform
compatibility
The test was failing on BSD and Windows due to platform-specific issues:
On BSD, tablespace names must start with 'regress_' prefix to comply
with regression test naming convention.
On Windows, the shell command 'mkdir -p' has different syntax and the
/tmp directory path is not standard.
Fix by using allow_in_place_tablespaces, a developer-only option that
allows creating tablespaces with empty LOCATION string. This creates
the tablespace directly in pg_tblspc without requiring external
directories or shell commands, making the test work consistently across
all platforms.
---
contrib/test_tde/expected/basic.out | 9 ++++-----
contrib/test_tde/sql/basic.sql | 9 ++++-----
2 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/contrib/test_tde/expected/basic.out b/contrib/test_tde/expected/basic.out
index 9932cf43614..d5bcc732d40 100644
--- a/contrib/test_tde/expected/basic.out
+++ b/contrib/test_tde/expected/basic.out
@@ -158,13 +158,13 @@ DROP TABLE test_reindex;
-- Test 5: ALTER TABLE SET TABLESPACE
-- RelFileNumber changes, but data is copied through storage hooks
-- -----------------------------------------------------------------------------
-\! mkdir -p /tmp/test_tde_tablespace
-CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tde_tblspc LOCATION '';
CREATE TABLE test_set_tablespace (id int, data text);
INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
CHECKPOINT;
-- Move to different tablespace - data copied through storage hooks
-ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+ALTER TABLE test_set_tablespace SET TABLESPACE regress_tde_tblspc;
-- Works fine - data was re-encrypted with new RelFileNumber
SELECT COUNT(*) FROM test_set_tablespace;
count
@@ -173,5 +173,4 @@ SELECT COUNT(*) FROM test_set_tablespace;
(1 row)
DROP TABLE test_set_tablespace;
-DROP TABLESPACE test_tde_tblspc;
-\! rm -rf /tmp/test_tde_tablespace
+DROP TABLESPACE regress_tde_tblspc;
diff --git a/contrib/test_tde/sql/basic.sql b/contrib/test_tde/sql/basic.sql
index 9b2651afee8..e52fdea7a5b 100644
--- a/contrib/test_tde/sql/basic.sql
+++ b/contrib/test_tde/sql/basic.sql
@@ -128,19 +128,18 @@ DROP TABLE test_reindex;
-- Test 5: ALTER TABLE SET TABLESPACE
-- RelFileNumber changes, but data is copied through storage hooks
-- -----------------------------------------------------------------------------
-\! mkdir -p /tmp/test_tde_tablespace
-CREATE TABLESPACE test_tde_tblspc LOCATION '/tmp/test_tde_tablespace';
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tde_tblspc LOCATION '';
CREATE TABLE test_set_tablespace (id int, data text);
INSERT INTO test_set_tablespace SELECT g, 'data ' || g FROM generate_series(1, 50) g;
CHECKPOINT;
-- Move to different tablespace - data copied through storage hooks
-ALTER TABLE test_set_tablespace SET TABLESPACE test_tde_tblspc;
+ALTER TABLE test_set_tablespace SET TABLESPACE regress_tde_tblspc;
-- Works fine - data was re-encrypted with new RelFileNumber
SELECT COUNT(*) FROM test_set_tablespace;
DROP TABLE test_set_tablespace;
-DROP TABLESPACE test_tde_tblspc;
-\! rm -rf /tmp/test_tde_tablespace
+DROP TABLESPACE regress_tde_tblspc;
--
2.50.1 (Apple Git-155)
Content of some WAL records can be almost completely predicated (it
contains no user data,
just some Postgres internal data which can be easily reconstructed).
I wonder if this fact can significantly simplify task of cracking cypher?
AES is designed to resist known plaintext attacks, this isn't an issue
as long as the code doesn't reuse the same IV twice. The example code
uses a random iv for each WAL record, so that's unlikely.
This is a quite nice solution to keep the encryption of WAL as
parallel as possible. The downside is that it increases the size of
WAL a bit, uses MemoryContextAllowInCriticalSection, and this approach
is definitely slower during recovery than full page decryption.
On the other hand, per page WAL encryption can cause performance
issues with some workloads that write huge amounts of WAL with many
parallel clients. Both have pros and cons.
One thing that seems tricky is wal key rotation. The example code
ignores this, which is fine for a demo, but real extensions should be
able to handle it. We can't simply write a wal record about changing
the wal key, because without holding the write lock things could get
written out of order. The only safe solution I see is to also add the
id of the wal key to the additional wal record data, increasing the
record size even more.
Import Notes
Reply to msg id not found: fd0fe833-09ca-436d-8293-638e0afd9f5d@garret.ru
The main difference is timing and current availability:
- The hook approach is working today and can be used immediately
. - Your SMGR extensibility work provides a more comprehensive
long-term solution
I disagree with this. The SMGR patch is available since 2023/PG16 as a
patch, and it is already used by at least 3 companies I know of (Neon,
Nile, Percona), and probably also by others I don't know of. It is
available immediately.
Compared to that this proposal is something new, and more limited.
The actual advantage of this proposal is that it includes WAL, but I
still think the two should be separate discussions.
Regarding what to protect (WAL vs heap vs both), there's flexibility depending on the organization and jurisdiction. The hook approach allows extensions to choose - you can implement only the buffer hooks if that satisfies your requirements, or add WAL hooks if needed.
My concern is that these two separate discussion about 2 extensibility
points, with different concerns by different people. One part
shouldn't stall the other, as for some, even getting half of it into
the core for PG19 would be useful.
You're absolutely right that extension developers need to understand multiprocess architecture, memory management, critical sections, and so on.
This is precisely why test_tde exists as a reference implementation.
The reference implementation ignores the tricky steps, like key
rotation, caching, configuration, providing a user interface, etc,
which all require knowledge of postgres internals.
ARIA and SEED are already implemented in OpenSSL. However, Korean law requires certified implementations. Specifically, companies must use nationally-certified builds and provide the hash codes of those specific library binaries to regulators. You cannot simply use the OpenSSL version, even if the algorithm is identical.
That could be still solved by introducing an abstraction layer in the
encryption code of a TDE extension :) Encryption is only a small part
of an extension, the other parts (user interface, rotation, key
storage integrations, etc) are a much bigger part. It is still
questionable to reimplement everything because of an encryption
library difference. But I see your point, that is a bit more
difficult.
That's a reasonable approach for SMGR-based solutions where you control the storage layer. However, with the hook approach, we don't have the ability to inject custom WAL records for encryption events.
Currently, in a replication environment, the reference implementation requires the same key to be configured in the settings on both primary and replicas (shared key model). For future KMS integration, I'm considering mechanisms to propagate keys to replicas through external channels rather than WAL.
I originally wrote a long answer about how I don't think this is
related to where the hooks are, and then I realized that the problem
is probably completely different - and this also shows why adding a
few bits to the pages is not a good generic solution for all
extensions.
Our extension uses a 2 level key architecture, as used by most
database servers (there's a master key, and it encodes separate
internal keys, one for each database file). The proposed sample code
in your patch uses a single key, with the IV encoding the database
file. That means you want to encode which key is used for each page
instead of for each file.
So we approach how we map data/pages to keys completely differently.
But I don't think the page header addition is a good solution, because
it is specific to your implementation, not for encryption solutions in
general.
(Also, I just noticed that you forgot about timelineid in derive_iv,
you probably want to include that somehow)
Hi Zsolt,
Thank you for the detailed feedback.
## SMGR Patch
You're right - I shouldn't argue about SMGR without actually reviewing
the patch. Let me step back from the SMGR discussion for now and focus
this proposal on WAL hooks only.
Could you point me to where I can access the SMGR extensibility patch?
I'd like to review it properly before any further discussion.
For SMGR, I'm also thinking about a different approach that could cover
Bootstrap and Frontend processes as well - but that's a separate
conversation after I understand the current SMGR proposal better.
## Reference Implementation Scope
As mentioned in my earlier messages, the reference implementation
(test_tde) intentionally doesn't cover key rotation and other
production concerns. Its purpose is to demonstrate the
hook API, not to be a production-ready TDE solution.
## Encryption Library Abstraction
I agree in principle that an abstraction layer would be ideal.
Personally, I prefer developing with OpenSSL and getting an OpenSSL
Provider certified at the company level.
However, our CTO (who comes from a cryptography company background)
insists on using their long-maintained proprietary encryption library.
It's a complex C++ implementation that cannot follow critical section
constraints at all. :)
## 2-Level Key Architecture
Our production implementation also uses a 2-level key architecture
(master key → separate smgr/wal keys). The reference implementation
uses a simplified single-key approach just for demonstration purposes.
I've been considering further key granularity (e.g., per-tablespace
or per-database keys), but there are unresolved challenges:
- Key distribution to replicas
- Some DDL operations that complete by simple file copying
Until these are solved, we're keeping the smgr key at a coarser
granularity.
I'm also exploring TPM integration for auto-login master key
protection. How does pg_tde handle master key storage and auto-login
scenarios?
## Timeline ID in IV
Good catch - I hadn't considered that. Including timeline ID would
make the IV more robust. Thank you for sharing this insight.
Best regards,
Henson
2025년 12월 29일 (월) PM 4:37, Zsolt Parragi <zsolt.parragi@percona.com>님이 작성:
Show quoted text
The main difference is timing and current availability:
- The hook approach is working today and can be used immediately
. - Your SMGR extensibility work provides a more comprehensive
long-term solutionI disagree with this. The SMGR patch is available since 2023/PG16 as a
patch, and it is already used by at least 3 companies I know of (Neon,
Nile, Percona), and probably also by others I don't know of. It is
available immediately.Compared to that this proposal is something new, and more limited.
The actual advantage of this proposal is that it includes WAL, but I
still think the two should be separate discussions.Regarding what to protect (WAL vs heap vs both), there's flexibility
depending on the organization and jurisdiction. The hook approach allows
extensions to choose - you can implement only the buffer hooks if that
satisfies your requirements, or add WAL hooks if needed.My concern is that these two separate discussion about 2 extensibility
points, with different concerns by different people. One part
shouldn't stall the other, as for some, even getting half of it into
the core for PG19 would be useful.You're absolutely right that extension developers need to understand
multiprocess architecture, memory management, critical sections, and so on.
This is precisely why test_tde exists as a reference implementation.
The reference implementation ignores the tricky steps, like key
rotation, caching, configuration, providing a user interface, etc,
which all require knowledge of postgres internals.ARIA and SEED are already implemented in OpenSSL. However, Korean law
requires certified implementations. Specifically, companies must use
nationally-certified builds and provide the hash codes of those specific
library binaries to regulators. You cannot simply use the OpenSSL version,
even if the algorithm is identical.That could be still solved by introducing an abstraction layer in the
encryption code of a TDE extension :) Encryption is only a small part
of an extension, the other parts (user interface, rotation, key
storage integrations, etc) are a much bigger part. It is still
questionable to reimplement everything because of an encryption
library difference. But I see your point, that is a bit more
difficult.That's a reasonable approach for SMGR-based solutions where you control
the storage layer. However, with the hook approach, we don't have the
ability to inject custom WAL records for encryption events.Currently, in a replication environment, the reference implementation
requires the same key to be configured in the settings on both primary and
replicas (shared key model). For future KMS integration, I'm considering
mechanisms to propagate keys to replicas through external channels rather
than WAL.I originally wrote a long answer about how I don't think this is
related to where the hooks are, and then I realized that the problem
is probably completely different - and this also shows why adding a
few bits to the pages is not a good generic solution for all
extensions.Our extension uses a 2 level key architecture, as used by most
database servers (there's a master key, and it encodes separate
internal keys, one for each database file). The proposed sample code
in your patch uses a single key, with the IV encoding the database
file. That means you want to encode which key is used for each page
instead of for each file.So we approach how we map data/pages to keys completely differently.
But I don't think the page header addition is a good solution, because
it is specific to your implementation, not for encryption solutions in
general.(Also, I just noticed that you forgot about timelineid in derive_iv,
you probably want to include that somehow)
Please don't top-post. We generally prefer to reply in-line, which makes
it easier to follow the discussion. With top-posting I have to seek what
are you responding to.
On 12/29/25 03:35, Henson Choi wrote:
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Tomas,
Thank you for this critical feedback. Your concerns go to the heart of
the proposal's viability, and I appreciate your directness.1. Multiple Extensions and Hook Chaining
You're right to question this. To be honest, I have significant doubts
about allowing multiple transformation extensions simultaneously.The Transform ID coordination problem is real: without a registry or
protocol between extensions, they cannot cooperate safely. Hook chaining
for read/write operations might work (extension A encrypts, extension B
compresses), but the Transform ID field creates conflicts.Perhaps I should be more direct: transformation hook chaining is not
realistically possible with the current design. TDE extensions would
need exclusive use of these hooks. This is a fundamental limitation I
should have stated clearly in the RFC.
Isn't that just another argument against using hooks? Chaining is what
hooks do, and there's no protection against a hook being set by multiple
extensions.
2. pd_flags Reservation - I Hope You'll Consider This
I understand your concern about reserving pd_flags bits for extensions.
However, I'd like to ask you to consider the reasoning behind this choice.The 5-bit Transform ID serves a critical purpose: it allows the core to
identify the page's transformation state without attempting decryption.
This is important for:- Error reporting: "This page is encrypted with transform ID 5, but no
extension is loaded to handle it"
- Migration safety: Distinguishing between untransformed pages (ID=0)
and transformed pages during gradual encryption
- Crash recovery: The core can detect transformation state inconsistenciesThat said, I recognize pd_flags is precious and limited. Let me propose
an alternative approach that might better align with core principles:
The information may be crucial, but pd_flags is simply not meant to be
used by extensions to store custom data.
Instead of extension-specific Transform IDs, what if we allow extensions
to reserve space at pd_upper (similar to how special space works at
pd_special)?The core could manage a small flag (2-3 bits) indicating "N bytes at
pd_upper are reserved for transformation metadata". By encoding N as
multiples of 2 or 4 bytes, we maximize the flag's efficiency:- 2 bits encoding 4-byte multiples: 0-12 bytes (sufficient for most cases)
- 3 bits encoding 4-byte multiples: 0-28 bytes (covers all reasonable needs)
- 3 bits encoding 2-byte multiples: 0-14 bytes (finer granularity)This approach uses minimal pd_flags bits while providing substantial
metadata space. It would:- Keep the flag in core control (not extension-specific)
- Allow extensions to store IV, authentication tags, key version, etc.
in a standardized location
- Be self-describing (the flag tells you how much space is reserved)
- Generalize beyond encryption (compression, checksums, etc. could use it)In our internal implementation, we actually add opaque bytes to
PageHeader for encryption metadata. This pd_upper approach could
formalize that pattern for extensions.I believe some form of page-level metadata for transformations is
necessary. Would either approach (Transform ID or pd_upper reservation)
be acceptable with the right design, or do you see fundamental issues
with page-level transformation metadata itself?
AFAICS this is pretty much exactly what this patch aimed to do (also to
allow implementing TDE):
https://commitfest.postgresql.org/patch/3986/
Clearly, it's not as simple as it may seem, otherwise the patch would
not be WIP for 3 years.
3. Maintenance Burden and Test Coverage
I deeply appreciate this concern. Having worked across various DBMS
implementations, I've seen solution vendors ship without comprehensive
regression testing - but never a database vendor. DBMS maintenance is
extraordinarily difficult, and storage errors are catastrophic.This is precisely why test_tde exists as a reference implementation. But
you've identified the real issue: we need much stronger test coverage
for the hooks themselves.The test cases should:
- Detect when core changes break hook contracts
- Verify hook behavior under all I/O paths (sync, async, error cases)
- Validate critical section safety
- Test interaction with checksums, crash recovery, replicationI agree the current test coverage is insufficient for core inclusion.
Would expanding the test suite to cover these scenarios address your
maintenance concerns, or do you see fundamental fragility beyond what
testing can solve?
I wasn't talking about test coverage. My point is we'd have to keep this
working forever, even if we choose to change how the SMGR works. Which
is not entirely theoretical.
4. Hooks vs Transform Layer - Pragmatic Timeline
You suggested improving SMGR extensibility rather than adding hooks. I
think you're architecturally right about the long-term direction.However, I want to be pragmatic about timelines:
The hook and pd_flags approach, despite its limitations, can deliver
working TDE in the shortest time. Organizations facing regulatory
deadlines need something that works now, not in 2-3 years.
Others may see it differently, but my opinion is using pd_flags is a
dead end.
I realize users may wish for a solution "soon", but we're not going to
accept a flawed approach because of that. Exchanging short-term benefit
for long-term pain does not seem like a good trade off.
That said, your feedback has sparked a better idea: what if we think of
this not as "SMGR extension" or "hooks" but as a pluggable Transform
Layer that SMGR and WAL subsystems delegate to?Conceptually:
Application Layer
|
Buffer Manager
|
+------------------+
| Transform Layer | <-- Encryption, etc.
+------------------+
|
SMGR / WAL
|
File I/OThis is architecturally cleaner than scattered hooks, and more focused
than full SMGR extensibility. The Transform Layer would:- Provide a unified interface for data transformation
- Work across backend, frontend tools, and replication
- Handle metadata management in a standardized way
- Support encryption, compression, or other transformationsI think this deserves its own discussion thread rather than conflating
it with the current hook proposal. Would you be interested in starting a
separate conversation about designing a Transform Layer interface for
PostgreSQL?
Maybe. But I'm not convinced it'd be great to have many parallel thread
discussing approaches for the same ultimate end goal.
In the meantime, the hook approach could serve organizations with
immediate needs, and extensions could migrate to the Transform Layer
once it's stabilized.
It's not like there are no alternatives, though. We have FDE/LUKS,
application-level encryption, etc. Now there's also pg_tde.
FWIW the hypothetical migration would be far from trivial.
5. Frontend Tool Access
Both SMGR and hook approaches face a shared limitation: frontend tools
(pg_checksums, pg_basebackup, etc.) that read files directly.
I'm not a TDE expert, but I don't see why would tools like pg_basebackup
need to be aware of this at all. A basebackup is just a filesystem copy.
I previously suggested allowing initdb to specify a shared library that
both backend and frontend can load for transformation. But as I
reconsider this, it feels like it converges toward the Transform Layer
idea: a well-defined interface that any PostgreSQL component can use.This might be the real architectural question: not "hooks vs SMGR" but
"how should PostgreSQL provide transformation points that work across
backend, frontend, and replication boundaries?"
Maybe. I was not proposing a new "transformation" layer, though. My
suggestion was entirely within the current SMGR architecture.
regards
--
Tomas Vondra
2025년 12월 30일 (화) AM 10:19, Tomas Vondra <tomas@vondra.me>님이 작성:
Please don't top-post. We generally prefer to reply in-line, which makes
it easier to follow the discussion. With top-posting I have to seek what
are you responding to.
Apologies for the formatting error. I'll follow inline-reply from now on.
On 12/29/25 03:35, Henson Choi wrote:
Subject: Re: RFC: PostgreSQL Storage I/O Transformation Hooks
Hi Tomas,
Thank you for this critical feedback. Your concerns go to the heart of
the proposal's viability, and I appreciate your directness.1. Multiple Extensions and Hook Chaining
You're right to question this. To be honest, I have significant doubts
about allowing multiple transformation extensions simultaneously.The Transform ID coordination problem is real: without a registry or
protocol between extensions, they cannot cooperate safely. Hook chaining
for read/write operations might work (extension A encrypts, extension B
compresses), but the Transform ID field creates conflicts.Perhaps I should be more direct: transformation hook chaining is not
realistically possible with the current design. TDE extensions would
need exclusive use of these hooks. This is a fundamental limitation I
should have stated clearly in the RFC.Isn't that just another argument against using hooks? Chaining is what
hooks do, and there's no protection against a hook being set by multiple
extensions.
You're absolutely right. As I mentioned in my reply to Zsolt, I'm stepping
back
from the hook approach to study the SMGR extensibility work first.
The chaining limitation you pointed out is fundamental - if TDE requires
exclusive access, then hooks are the wrong mechanism. I should have
reviewed
existing SMGR extensibility efforts before proposing hooks.
2. pd_flags Reservation - I Hope You'll Consider This
I understand your concern about reserving pd_flags bits for extensions.
However, I'd like to ask you to consider the reasoning behind thischoice.
The 5-bit Transform ID serves a critical purpose: it allows the core to
identify the page's transformation state without attempting decryption.
This is important for:- Error reporting: "This page is encrypted with transform ID 5, but no
extension is loaded to handle it"
- Migration safety: Distinguishing between untransformed pages (ID=0)
and transformed pages during gradual encryption
- Crash recovery: The core can detect transformation stateinconsistencies
That said, I recognize pd_flags is precious and limited. Let me propose
an alternative approach that might better align with core principles:The information may be crucial, but pd_flags is simply not meant to be
used by extensions to store custom data.
Understood. I see now why this is a non-starter.
Instead of extension-specific Transform IDs, what if we allow extensions
to reserve space at pd_upper (similar to how special space works at
pd_special)?The core could manage a small flag (2-3 bits) indicating "N bytes at
pd_upper are reserved for transformation metadata". By encoding N as
multiples of 2 or 4 bytes, we maximize the flag's efficiency:- 2 bits encoding 4-byte multiples: 0-12 bytes (sufficient for most
cases)
- 3 bits encoding 4-byte multiples: 0-28 bytes (covers all reasonable
needs)
- 3 bits encoding 2-byte multiples: 0-14 bytes (finer granularity)
This approach uses minimal pd_flags bits while providing substantial
metadata space. It would:- Keep the flag in core control (not extension-specific)
- Allow extensions to store IV, authentication tags, key version, etc.
in a standardized location
- Be self-describing (the flag tells you how much space is reserved)
- Generalize beyond encryption (compression, checksums, etc. could useit)
In our internal implementation, we actually add opaque bytes to
PageHeader for encryption metadata. This pd_upper approach could
formalize that pattern for extensions.I believe some form of page-level metadata for transformations is
necessary. Would either approach (Transform ID or pd_upper reservation)
be acceptable with the right design, or do you see fundamental issues
with page-level transformation metadata itself?AFAICS this is pretty much exactly what this patch aimed to do (also to
allow implementing TDE):https://commitfest.postgresql.org/patch/3986/
Clearly, it's not as simple as it may seem, otherwise the patch would
not be WIP for 3 years.
Thank you - this is exactly what I needed to see. Combined with Zsolt's
pointer to
the SMGR patch already in production, I clearly should have done this
research
before proposing. I'll study both: the working SMGR solution and why patch
3986
has been WIP for 3 years. That should give me proper context.
3. Maintenance Burden and Test Coverage
I deeply appreciate this concern. Having worked across various DBMS
implementations, I've seen solution vendors ship without comprehensive
regression testing - but never a database vendor. DBMS maintenance is
extraordinarily difficult, and storage errors are catastrophic.This is precisely why test_tde exists as a reference implementation. But
you've identified the real issue: we need much stronger test coverage
for the hooks themselves.The test cases should:
- Detect when core changes break hook contracts
- Verify hook behavior under all I/O paths (sync, async, error cases)
- Validate critical section safety
- Test interaction with checksums, crash recovery, replicationI agree the current test coverage is insufficient for core inclusion.
Would expanding the test suite to cover these scenarios address your
maintenance concerns, or do you see fundamental fragility beyond what
testing can solve?I wasn't talking about test coverage. My point is we'd have to keep this
working forever, even if we choose to change how the SMGR works. Which
is not entirely theoretical.
I understand now. The maintenance burden isn't about testing - it's
about constraining future architectural evolution. Once hooks are in
core, they become an API contract that limits PostgreSQL's ability to
refactor SMGR.
This is exactly why SMGR extensibility is the right approach - it makes
the extension points explicit and architectural, rather than scattering
hooks that lock in implementation details.
4. Hooks vs Transform Layer - Pragmatic Timeline
You suggested improving SMGR extensibility rather than adding hooks. I
think you're architecturally right about the long-term direction.However, I want to be pragmatic about timelines:
The hook and pd_flags approach, despite its limitations, can deliver
working TDE in the shortest time. Organizations facing regulatory
deadlines need something that works now, not in 2-3 years.Others may see it differently, but my opinion is using pd_flags is a
dead end.I realize users may wish for a solution "soon", but we're not going to
accept a flawed approach because of that. Exchanging short-term benefit
for long-term pain does not seem like a good trade off.
Agreed. Though companies are already using SMGR patches in production,
which works while we develop the proper upstream solution.
I'll study these approaches.
That said, your feedback has sparked a better idea: what if we think of
this not as "SMGR extension" or "hooks" but as a pluggable Transform
Layer that SMGR and WAL subsystems delegate to?Conceptually:
Application Layer
|
Buffer Manager
|
+------------------+
| Transform Layer | <-- Encryption, etc.
+------------------+
|
SMGR / WAL
|
File I/OThis is architecturally cleaner than scattered hooks, and more focused
than full SMGR extensibility. The Transform Layer would:- Provide a unified interface for data transformation
- Work across backend, frontend tools, and replication
- Handle metadata management in a standardized way
- Support encryption, compression, or other transformationsI think this deserves its own discussion thread rather than conflating
it with the current hook proposal. Would you be interested in starting a
separate conversation about designing a Transform Layer interface for
PostgreSQL?Maybe. But I'm not convinced it'd be great to have many parallel thread
discussing approaches for the same ultimate end goal.
Understood about avoiding thread fragmentation.
I do wonder where bootstrap and frontend tool encryption should be
discussed - whether that belongs in the 3986 discussion or elsewhere -
but I should study that patch thoroughly first before raising the
question.
In the meantime, the hook approach could serve organizations with
immediate needs, and extensions could migrate to the Transform Layer
once it's stabilized.It's not like there are no alternatives, though. We have FDE/LUKS,
application-level encryption, etc. Now there's also pg_tde.FWIW the hypothetical migration would be far from trivial.
5. Frontend Tool Access
Both SMGR and hook approaches face a shared limitation: frontend tools
(pg_checksums, pg_basebackup, etc.) that read files directly.I'm not a TDE expert, but I don't see why would tools like pg_basebackup
need to be aware of this at all. A basebackup is just a filesystem copy.
You're right - pg_basebackup itself just copies files. The issue I
mentioned was actually specific to our implementation (key storage
under PGDATA with symlinks), not a general TDE concern.
However, tools like pg_checksums that directly read buffer pages,
or tools that read WAL pages, do present a broader question: SMGR
extensibility handles backend I/O, but these frontend tools operate
outside that architecture.
This makes me wonder if a more comprehensive layer might be needed
to cover both backend (SMGR) and frontend tools. But I should study
the existing SMGR work first to see how this is currently addressed.
I previously suggested allowing initdb to specify a shared library that
both backend and frontend can load for transformation. But as I
reconsider this, it feels like it converges toward the Transform Layer
idea: a well-defined interface that any PostgreSQL component can use.This might be the real architectural question: not "hooks vs SMGR" but
"how should PostgreSQL provide transformation points that work across
backend, frontend, and replication boundaries?"Maybe. I was not proposing a new "transformation" layer, though. My
suggestion was entirely within the current SMGR architecture.Maybe. I was not proposing a new "transformation" layer, though. My
suggestion was entirely within the current SMGR architecture.
Understood.
Though I wonder if WAL encryption should be part of the same
discussion, or separate. SMGR handles pages, but WAL has different
characteristics.
Should this be in patch 3986, or separate?
Show quoted text
regards
--
Tomas Vondra
Could you point me to where I can access the SMGR extensibility patch?
I'd like to review it properly before any further discussion.
There's the hadkers discussion thread[1]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com, the PG18 branch of our fork
[2]: https://github.com/percona/postgres/commits/PSP_REL_18_STABLE/
are easy to find), and of course you can look at how pg_tde uses
it[3]https://github.com/percona/pg_tde/blob/main/src/smgr/pg_tde_smgr.c.
But please note that none of them is 100% up to date. The hackers
thread is for PG17 (no AIO part yet). And we also had some in person
discussions about the patch during PgConf.Eu, which is not yet
reflected even in our fork. We plan to update the mailing list thread
in January.
Though I wonder if WAL encryption should be part of the same
discussion, or separate. SMGR handles pages, but WAL has different
characteristics.
I think we should keep it separate, the SMGR question is much simpler than WAL.
Do you think this is a reasonable direction? Or would you prefer a
different approach?
I have no preferred approach for WAL yet. Our solution in pg_tde has
some good and bad points, and the approach you used here similarly has
some good and bad. The main reason why we kept delaying opening a
"let's add WAL hooks" discussion on the mailing list is because we
weren't confident enough in our current approach. Is it good for a
fork? Definitely. Is it good enough for getting it accepted into the
core? Probably not.
Personally I tried to come up with an approach that could be useful
for something else other than tde, including some proof of concept
implementation of that something. (for example wal compression, or
enabling an extension to split wal into separate streams for each
database) But that's not easy to do, I didn't spend too much time on
it so far, and maybe not even necessary, maybe simpler is better in
this case.
Starting a discussion about it is definitely a good idea, but maybe
the focus should be on debating/trying out different approaches
instead of proposing specific solutions to be included in pg? From
this point it is great that your implementation is different, because
we can talk about pros/cons, maybe figure out something even better?
[1]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[2]: https://github.com/percona/postgres/commits/PSP_REL_18_STABLE/
[3]: https://github.com/percona/pg_tde/blob/main/src/smgr/pg_tde_smgr.c