Direct I/O

Started by Thomas Munroabout 3 years ago113 messages
#1Thomas Munro
thomas.munro@gmail.com
3 attachment(s)

Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]https://wiki.postgresql.org/wiki/AIO. It adds three new settings, defaulting to off:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

O_DIRECT asks the kernel to avoid caching file data as much as
possible. Here's a fun quote about it[2]https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics:

"The exact semantics of Direct I/O (O_DIRECT) are not well specified.
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour has
generally been passed down as oral lore rather than as a formal set of
requirements and specifications."

It gives the kernel the opportunity to move data directly between
PostgreSQL's user space buffers and the storage hardware using DMA
hardware, that is, without CPU involvement or copying. Not all
storage stacks can do that, for various reasons, but even if not, the
caching policy should ideally still use temporary buffers and avoid
polluting the page cache.

These settings currently destroy performance, and are not intended to
be used by end-users, yet! That's why we filed them under
DEVELOPER_OPTIONS. You don't get automatic read-ahead, concurrency,
clustering or (of course) buffering from the kernel. The idea is that
later parts of the AIO patch-set will introduce mechanisms to replace
what the kernel is doing for us today, and then more, since we ought
to be even better at predicting our own future I/O than it, so that
we'll finish up ahead. Even with all that, you wouldn't want to turn
it on by default because the default shared_buffers would be
insufficient for any real system, and there are portability problems.

Examples of slowness:

* every 8KB sequential read or write becomes a full round trip to the
storage, one at a time

* data that is written to WAL and then read back in by WAL sender will
incur full I/O round trip (that's probably not really an AIO problem,
that's something we should probably address by using shared memory
instead of files, as noted as a TODO item in the source code)

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.

Assorted portability notes:

I expect this to "work" (that is, successfully destroy performance) on
typical developer systems running at least Linux, macOS, Windows and
FreeBSD. By work, I mean: not be rejected by PostgreSQL, not be
rejected by the kernel, and influence kernel cache behaviour on common
filesystems. It might be rejected with ENOSUPP, EINVAL etc on some
more exotic filesystems and OSes. Of currently supported OSes, only
OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the
GUCs. For macOS and Windows we internally translate our own
PG_O_DIRECT flag to the correct flags/calls (committed a while
back[3]/messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com).

On Windows, scatter/gather is available only with direct I/O, so a
true pwritev would in theory be possible, but that has some more
complications and is left for later patches (probably using native
interfaces, not disguising as POSIX).

There may be systems on which 8KB offset alignment will not work at
all or not work well, and that's expected. For example, BTRFS, ZFS,
JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an
8KB write will have to trigger a read-before-write. Note that
offset/length alignment requirements (blocks) are independent of
buffer alignment requirements (memory pages, 4KB).

The behaviour and cache coherency of files that have open descriptors
using both direct and non-direct flags may be complicated and vary
between systems. The patch currently lets you change the GUCs at
runtime so backends can disagree: that should probably not be allowed,
but is like that now for experimentation. More study is required.

If someone has a compiler that we don't know how to do
pg_attribute_aligned() for, then we can't make correctly aligned stack
buffers, so in that case direct I/O is disabled, but I don't know of
such a system (maybe aCC, but we dropped it). That's why smgr code
can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and
why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned()
macro, disabling the GUCs.

This seems to be an independent enough piece to get into the tree on
its own, with the proviso that it's not actually useful yet other than
for experimentation. Thoughts?

These patches have been hacked on at various times by Andres Freund,
David Rowley and me.

[1]: https://wiki.postgresql.org/wiki/AIO
[2]: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
[3]: /messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com

Attachments:

0001-Align-PGAlignedBlock-to-expected-page-size.patchtext/x-patch; charset=US-ASCII; name=0001-Align-PGAlignedBlock-to-expected-page-size.patchDownload
From 87a0c14600506d2a33a5a6bedc6e58d70ff7acc7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 24 Jun 2020 16:35:49 -0700
Subject: [PATCH 1/3] Align PGAlignedBlock to expected page size.

In order to be allowed to use O_DIRECT, we need to align buffers to the
page or sector size.

Author: Andres Freund <andres@anarazel.de>
Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/include/c.h                | 20 ++++++++++++--------
 src/include/pg_config_manual.h |  8 ++++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index d70ed84ac5..0deaca0414 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1070,17 +1070,18 @@ extern void ExceptionalCondition(const char *conditionName,
 
 /*
  * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes.  Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware.  (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.)  We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page or passed to
+ * an I/O function and not just a string of bytes.  Otherwise the variable
+ * might be under-aligned, causing problems on alignment-picky hardware, or if
+ * PG_O_DIRECT is used.  We include both "double" and "int64" in the union to
+ * ensure that the compiler knows the value must be MAXALIGN'ed (cf.
+ * configure's computation of MAXIMUM_ALIGNOF).
  */
 typedef union PGAlignedBlock
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
 	char		data[BLCKSZ];
 	double		force_align_d;
 	int64		force_align_i64;
@@ -1089,6 +1090,9 @@ typedef union PGAlignedBlock
 /* Same, but for an XLOG_BLCKSZ-sized buffer */
 typedef union PGAlignedXLogBlock
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
 	char		data[XLOG_BLCKSZ];
 	double		force_align_d;
 	int64		force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..a2ad08a110 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,14 @@
  */
 #define PG_CACHE_LINE_SIZE		128
 
+/*
+ * Assumed memory alignment requirement for direct I/O.  The real requirement
+ * may be based on sectors or pages.  The default is the typical modern sector
+ * size and virtual memory page size, which is enough for currently known
+ * systems.
+ */
+#define PG_IO_ALIGN_SIZE		4096
+
 /*
  *------------------------------------------------------------------------
  * The following symbols are for enabling debugging code, not for
-- 
2.35.1

0002-XXX-palloc_io_aligned-not-for-review-here.patchtext/x-patch; charset=US-ASCII; name=0002-XXX-palloc_io_aligned-not-for-review-here.patchDownload
From 7a1521dcafbc42b2482d16e8dd0781dfbd5ef2b4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Oct 2022 09:47:45 -0700
Subject: [PATCH 2/3] XXX palloc_io_aligned() -- not for review here

This patch will be posted for review by David Rowley in its own thread,
but a copy is included here as a dependency.
---
 contrib/bloom/blinsert.c                   |  2 +-
 src/backend/access/gist/gistbuild.c        |  8 +-
 src/backend/access/gist/gistbuildbuffers.c |  5 +-
 src/backend/access/heap/rewriteheap.c      |  2 +-
 src/backend/access/nbtree/nbtree.c         |  2 +-
 src/backend/access/nbtree/nbtsort.c        |  8 +-
 src/backend/access/spgist/spginsert.c      |  2 +-
 src/backend/nodes/gen_node_support.pl      |  2 +-
 src/backend/storage/buffer/buf_init.c      |  7 +-
 src/backend/storage/buffer/localbuf.c      |  4 +-
 src/backend/storage/page/bufpage.c         |  2 +-
 src/backend/storage/smgr/md.c              | 14 ++-
 src/backend/utils/mmgr/mcxt.c              | 99 ++++++++++++++++++++--
 src/include/nodes/memnodes.h               |  5 +-
 src/include/utils/memutils_internal.h      |  4 +-
 src/include/utils/palloc.h                 |  5 ++
 16 files changed, 141 insertions(+), 30 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..b0da3ac529 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
 	BloomFillMetapage(index, metapage);
 
 	/*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..2daa9b2e10 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
-	page = palloc0(BLCKSZ);
+	page = palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
 	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
 			   page, true);
 	state->pages_allocated++;
@@ -509,7 +509,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+			levelstate->pages[levelstate->current_page] = palloc_io_aligned(BLCKSZ, 0);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -579,7 +579,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc0(BLCKSZ);
+		target = (Page) palloc_io_aligned(BLCKSZ, 0);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -630,7 +630,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc(BLCKSZ);
+			parent->pages[0] = (Page) palloc_io_aligned(BLCKSZ, 0);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 538e3880c9..9e188633ae 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -186,8 +186,9 @@ gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
 {
 	GISTNodeBufferPage *pageBuffer;
 
-	pageBuffer = (GISTNodeBufferPage *) MemoryContextAllocZero(gfbb->context,
-															   BLCKSZ);
+	pageBuffer = (GISTNodeBufferPage *)
+		MemoryContextAllocIOAligned(gfbb->context,
+									BLCKSZ, MCXT_ALLOC_ZERO);
 	pageBuffer->prev = InvalidBlockNumber;
 
 	/* Set page free space */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index b01b39b008..6fe7f1aed4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -257,7 +257,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc(BLCKSZ);
+	state->rs_buffer = (Page) palloc_io_aligned(BLCKSZ, 0);
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
 	state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..924da953aa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..563e6cce1f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_io_aligned(BLCKSZ, 0);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	while (blkno > wstate->btws_pages_written)
 	{
 		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+			wstate->btws_zeropage =
+				(Page) palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..d5b83710e4 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
 	Page		page;
 
 	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_io_aligned(BLCKSZ, 0);
 	SpGistInitMetapage(page);
 
 	/*
diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index 81b8c184a9..9598056821 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -142,7 +142,7 @@ my @abstract_types = qw(Node);
 # they otherwise don't participate in node support.
 my @extra_tags = qw(
   IntList OidList XidList
-  AllocSetContext GenerationContext SlabContext
+  AllocSetContext GenerationContext SlabContext AlignedAllocRedirectContext
   TIDBitmap
   WindowObjectData
 );
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..edd9bd48c3 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -79,8 +79,9 @@ InitBufferPool(void)
 						&foundDescs);
 
 	BufferBlocks = (char *)
-		ShmemInitStruct("Buffer Blocks",
-						NBuffers * (Size) BLCKSZ, &foundBufs);
+		TYPEALIGN(BLCKSZ,
+				  ShmemInitStruct("Buffer Blocks",
+								  (NBuffers + 1) * (Size) BLCKSZ, &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -164,6 +165,8 @@ BufferShmemSize(void)
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 
 	/* size of data pages */
+    /* to allow aligning buffer blocks */
+	size = add_size(size, BLCKSZ);
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..f51d3527f6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,8 @@ GetLocalBufferStorage(void)
 		/* And don't overflow MaxAllocSize, either */
 		num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
 
-		cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
-												num_bufs * BLCKSZ);
+		cur_block = (char *) MemoryContextAllocIOAligned(LocalBufferContext,
+														 num_bufs * BLCKSZ, 0);
 		next_buf_in_block = 0;
 		num_bufs_in_block = num_bufs;
 	}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..42f6f1782a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,7 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
 	 * and second to avoid wasting space in processes that never call this.
 	 */
 	if (pageCopy == NULL)
-		pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+		pageCopy = MemoryContextAllocIOAligned(TopMemoryContext, BLCKSZ, 0);
 
 	memcpy(pageCopy, (char *) page, BLCKSZ);
 	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index a515bb36ac..719721a894 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -439,6 +439,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+#if PG_O_DIRECT != 0
+	AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum >= mdnblocks(reln, forknum));
@@ -661,6 +665,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+#if PG_O_DIRECT != 0
+	AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
 										reln->smgr_rlocator.locator.spcOid,
 										reln->smgr_rlocator.locator.dbOid,
@@ -726,6 +734,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+#if PG_O_DIRECT != 0
+	AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
@@ -1280,7 +1292,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 */
 			if (nblocks < ((BlockNumber) RELSEG_SIZE))
 			{
-				char	   *zerobuf = palloc0(BLCKSZ);
+				char	   *zerobuf = palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index f526ca82c1..807c0f3af3 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -36,6 +36,9 @@ static void BogusFree(void *pointer);
 static void *BogusRealloc(void *pointer, Size size);
 static MemoryContext BogusGetChunkContext(void *pointer);
 static Size BogusGetChunkSpace(void *pointer);
+static void AlignedAllocFree(void *pointer);
+static MemoryContext AlignedAllocGetChunkContext(void *pointer);
+
 
 /*****************************************************************************
  *	  GLOBAL MEMORY															 *
@@ -84,6 +87,10 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_SLAB_ID].check = SlabCheck,
 #endif
 
+	/* in here */
+	[MCTX_ALIGNED_REDIRECT_ID].get_chunk_context = AlignedAllocGetChunkContext,
+	[MCTX_ALIGNED_REDIRECT_ID].free_p = AlignedAllocFree,
+
 	/*
 	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
 	 * fail cleanly if a bogus pointer is passed to pfree or the like.  It
@@ -110,11 +117,6 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
 	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
 	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
-
-	[MCTX_UNUSED5_ID].free_p = BogusFree,
-	[MCTX_UNUSED5_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
 };
 
 /*
@@ -1306,11 +1308,16 @@ void
 pfree(void *pointer)
 {
 #ifdef USE_VALGRIND
+	MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
 	MemoryContext context = GetMemoryChunkContext(pointer);
 #endif
 
 	MCXT_METHOD(pointer, free_p) (pointer);
-	VALGRIND_MEMPOOL_FREE(context, pointer);
+
+#ifdef USE_VALGRIND
+	if (method != MCTX_ALIGNED_REDIRECT_ID)
+		VALGRIND_MEMPOOL_FREE(context, pointer);
+#endif
 }
 
 /*
@@ -1497,3 +1504,83 @@ pchomp(const char *in)
 		n--;
 	return pnstrdup(in, n);
 }
+
+/*
+ * pointer to fake memory context + pointer to actual allocation
+ */
+#define ALIGNED_ALLOC_CHUNK_SIZE (sizeof(uintptr_t) + sizeof(uintptr_t))
+
+#include "utils/memutils_memorychunk.h"
+
+static void
+AlignedAllocFree(void *pointer)
+{
+	MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+	void *unaligned;
+
+	Assert(!MemoryChunkIsExternal(chunk));
+
+	unaligned = MemoryChunkGetBlock(chunk);
+
+	pfree(unaligned);
+}
+
+MemoryContext
+AlignedAllocGetChunkContext(void *pointer)
+{
+	MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+
+	Assert(!MemoryChunkIsExternal(chunk));
+
+	return GetMemoryChunkContext(MemoryChunkGetBlock(chunk));
+}
+
+void *
+MemoryContextAllocAligned(MemoryContext context,
+						  Size size, Size alignto, int flags)
+{
+	Size		alloc_size;
+	void	   *unaligned;
+	void	   *aligned;
+
+	/* wouldn't make much sense to waste that much space */
+	Assert(alignto < (128 * 1024 * 1024));
+
+	if (alignto < MAXIMUM_ALIGNOF)
+		return palloc_extended(size, flags);
+
+	/* allocate enough space for alignment padding */
+	alloc_size = size + alignto + sizeof(MemoryChunk);
+
+	unaligned = MemoryContextAllocExtended(context, alloc_size, flags);
+
+	aligned = (char *) unaligned + sizeof(MemoryChunk);
+	aligned = (void *) (TYPEALIGN(alignto, aligned) - sizeof(MemoryChunk));
+
+	MemoryChunkSetHdrMask(aligned, unaligned, 0, MCTX_ALIGNED_REDIRECT_ID);
+
+	/* XXX: should we adjust valgrind state here? */
+
+	Assert((char *) TYPEALIGN(alignto, MemoryChunkGetPointer(aligned)) == MemoryChunkGetPointer(aligned));
+
+	return MemoryChunkGetPointer(aligned);
+}
+
+void *
+MemoryContextAllocIOAligned(MemoryContext context, Size size, int flags)
+{
+	// FIXME: don't hardcode page size
+	return MemoryContextAllocAligned(context, size, 4096, flags);
+}
+
+void *
+palloc_aligned(Size size, Size alignto, int flags)
+{
+	return MemoryContextAllocAligned(CurrentMemoryContext, size, alignto, flags);
+}
+
+void *
+palloc_io_aligned(Size size, int flags)
+{
+	return MemoryContextAllocIOAligned(CurrentMemoryContext, size, flags);
+}
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index 63d07358cd..dcfe41806a 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -104,10 +104,11 @@ typedef struct MemoryContextData
  *
  * Add new context types to the set accepted by this macro.
  */
-#define MemoryContextIsValid(context) \
+#define MemoryContextIsValid(context)                                         \
 	((context) != NULL && \
 	 (IsA((context), AllocSetContext) || \
 	  IsA((context), SlabContext) || \
-	  IsA((context), GenerationContext)))
+	  IsA((context), GenerationContext) || \
+	  IsA((context), AlignedAllocRedirectContext)))
 
 #endif							/* MEMNODES_H */
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index bc2cbdd506..9611a192a2 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -92,8 +92,8 @@ typedef enum MemoryContextMethodID
 	MCTX_ASET_ID,
 	MCTX_GENERATION_ID,
 	MCTX_SLAB_ID,
-	MCTX_UNUSED4_ID,			/* available */
-	MCTX_UNUSED5_ID				/* 111 occurs in wipe_mem'd memory */
+	MCTX_ALIGNED_REDIRECT_ID,
+	MCTX_UNUSED4_ID				/* 111 occurs in wipe_mem'd memory */
 } MemoryContextMethodID;
 
 /*
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 8eee0e2938..0b0ba2a953 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -73,10 +73,15 @@ extern void *MemoryContextAllocZero(MemoryContext context, Size size);
 extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size);
 extern void *MemoryContextAllocExtended(MemoryContext context,
 										Size size, int flags);
+extern void *MemoryContextAllocAligned(MemoryContext context,
+									   Size size, Size alignto, int flags);
+extern void *MemoryContextAllocIOAligned(MemoryContext context, Size size, int flags);
 
 extern void *palloc(Size size);
 extern void *palloc0(Size size);
 extern void *palloc_extended(Size size, int flags);
+extern void *palloc_aligned(Size size, Size alignto, int flags);
+extern void *palloc_io_aligned(Size size, int flags);
 extern pg_nodiscard void *repalloc(void *pointer, Size size);
 extern pg_nodiscard void *repalloc_extended(void *pointer,
 											Size size, int flags);
-- 
2.35.1

0003-Add-direct-I-O-settings-developer-only.patchtext/x-patch; charset=US-ASCII; name=0003-Add-direct-I-O-settings-developer-only.patchDownload
From 819a406f029b04ab6a500f63fe9c154332b65d8e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 3 Oct 2022 21:58:22 -0700
Subject: [PATCH 3/3] Add direct I/O settings (developer-only).

Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files.  This hurts performance currently and is not
intended for end-users yet.  Later proposed work will introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.

This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.

Discussion: https://postgr.es/m/
Author: Andres Freund <andres@anarazel.de>
Author: Thomas Munro <thomas.munro@gmail.com>
---
 doc/src/sgml/config.sgml                    | 51 ++++++++++++++++++++
 src/backend/access/transam/xlog.c           | 53 +++++++++++++--------
 src/backend/access/transam/xlogprefetcher.c |  2 +-
 src/backend/storage/buffer/bufmgr.c         | 13 +++--
 src/backend/storage/buffer/localbuf.c       |  4 +-
 src/backend/storage/file/fd.c               |  5 ++
 src/backend/storage/smgr/md.c               | 29 +++++++++--
 src/backend/storage/smgr/smgr.c             | 20 ++++++++
 src/backend/utils/misc/guc_tables.c         | 33 +++++++++++++
 src/include/access/xlog.h                   |  2 +
 src/include/storage/fd.h                    |  6 ++-
 src/include/storage/smgr.h                  |  5 ++
 src/include/utils/guc_hooks.h               |  2 +
 13 files changed, 190 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 559eb898a9..2d860dd900 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11011,6 +11011,57 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-io-data-direct" xreflabel="io_data_direct">
+      <term><varname>io_data_direct</varname> (<type>boolean</type>)
+      <indexterm>
+        <primary><varname>io_data_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects for relation data files
+        using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).  Currently this
+        hurts performance, and is intended for developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-io-wal-direct" xreflabel="io_wal_direct">
+      <term><varname>io_wal_direct</varname> (<type>boolean</type>)
+      <indexterm>
+        <primary><varname>io_wal_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects while writing WAL files
+        using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).  Currently this
+        hurts performance, and is intended for developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-io-wal-init-direct" xreflabel="io_wal_init_direct">
+      <term><varname>io_wal_init_direct</varname> (<type>boolean</type>)
+      <indexterm>
+        <primary><varname>io_wal_init_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects while initializing WAL files
+        using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).  Currently this
+        hurts performance, and is intended for developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
       <term><varname>post_auth_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..5663bdf856 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -138,6 +138,8 @@ int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
+bool		io_wal_direct = false;
+bool		io_wal_init_direct = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2926,6 +2928,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	XLogSegNo	max_segno;
 	int			fd;
 	int			save_errno;
+	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
 
 	Assert(logtli != 0);
 
@@ -2958,8 +2961,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 
 	unlink(tmppath);
 
+	if (io_wal_init_direct)
+		open_flags |= PG_O_DIRECT;
+
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = BasicOpenFile(tmppath, open_flags);
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3373,7 +3379,7 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && !io_wal_direct)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
@@ -4473,6 +4479,21 @@ show_in_hot_standby(void)
 	return RecoveryInProgress() ? "on" : "off";
 }
 
+/*
+ * GUC check for direct I/O support.
+ */
+bool
+check_io_wal_direct(bool *newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+	if (*newval)
+	{
+		GUC_check_errdetail("io_wal_direct and io_wal_init_direct are not supported on this platform.");
+		return false;
+	}
+#endif
+	return true;
+}
 
 /*
  * Read the control file, set respective GUCs.
@@ -8056,35 +8077,27 @@ xlog_redo(XLogReaderState *record)
 }
 
 /*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_wal_direct.
  */
 static int
 get_sync_bit(int method)
 {
 	int			o_direct_flag = 0;
 
-	/* If fsync is disabled, never open in sync mode */
-	if (!enableFsync)
-		return 0;
-
 	/*
-	 * Optimize writes by bypassing kernel cache with O_DIRECT when using
-	 * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
-	 * otherwise the archive command or walsender process will read the WAL
-	 * soon after writing it, which is guaranteed to cause a physical read if
-	 * we bypassed the kernel cache. We also skip the
-	 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
-	 * reason.
-	 *
-	 * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+	 * Use O_DIRECT if requested, except in walreceiver process.  The WAL
 	 * written by walreceiver is normally read by the startup process soon
-	 * after it's written. Also, walreceiver performs unaligned writes, which
+	 * after it's written.  Also, walreceiver performs unaligned writes, which
 	 * don't work with O_DIRECT, so it is required for correctness too.
 	 */
-	if (!XLogIsNeeded() && !AmWalReceiverProcess())
+	if (io_wal_direct && !AmWalReceiverProcess())
 		o_direct_flag = PG_O_DIRECT;
 
+	/* If fsync is disabled, never open in sync mode */
+	if (!enableFsync)
+		return o_direct_flag;
+
 	switch (method)
 	{
 			/*
@@ -8096,7 +8109,7 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
-			return 0;
+			return o_direct_flag;
 #ifdef O_SYNC
 		case SYNC_METHOD_OPEN:
 			return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..d840078afc 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				block->prefetch_buffer = InvalidBuffer;
 				return LRQ_NEXT_IO;
 			}
-			else
+			else if (!io_data_direct)
 			{
 				/*
 				 * This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..9918855f37 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,7 +535,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
 		 */
-		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+		if (!io_data_direct && smgrprefetch(smgr_reln, forkNum, blockNum))
 			result.initiated_io = true;
 #endif							/* USE_PREFETCH */
 	}
@@ -582,11 +582,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
  * the kernel and therefore didn't really initiate I/O, and no way to know when
  * the I/O completes other than using synchronous ReadBuffer().
  *
- * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and
  * USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery.  (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), io_data_direct is enabled, or the underlying
+ * relation file wasn't found and we are in recovery.  (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
  */
 PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4908,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 {
 	PendingWriteback *pending;
 
+	if (io_data_direct)
+		return;
+
 	/*
 	 * Add buffer to the pending writeback array, unless writeback control is
 	 * disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f51d3527f6..f9c82a789e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,8 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	{
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
-		smgrprefetch(smgr, forkNum, blockNum);
-		result.initiated_io = true;
+		if (!io_data_direct && smgrprefetch(smgr, forkNum, blockNum))
+			result.initiated_io = true;
 #endif							/* USE_PREFETCH */
 	}
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec5..aa720952f8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2021,6 +2021,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 	if (nbytes <= 0)
 		return;
 
+#ifdef PG_O_DIRECT
+	if (VfdCache[file].fileFlags & PG_O_DIRECT)
+		return;
+#endif
+
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
 		return;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 719721a894..20ec37c310 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,21 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+	int		flags = O_RDWR | PG_BINARY;
+
+	/*
+	 * XXX: not clear if direct IO ever is interesting for other forks?  The
+	 * FSM fork currently often ends up very fragmented when using direct IO,
+	 * for example.
+	 */
+	if (io_data_direct /* && forkNum == MAIN_FORKNUM */)
+		flags |= PG_O_DIRECT;
+
+	return flags;
+}
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +220,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -513,7 +528,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 
 	if (fd < 0)
 	{
@@ -584,6 +599,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	off_t		seekpos;
 	MdfdVec    *v;
 
+	Assert(!io_data_direct);
+
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 	if (v == NULL)
@@ -609,6 +626,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	Assert(!io_data_direct);
+
 	/*
 	 * Issue flush requests in as few requests as possible; have to split at
 	 * segment boundaries though, since those are actually separate files.
@@ -1186,7 +1205,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
 
 	pfree(fullpath);
 
@@ -1395,7 +1414,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		strlcpy(path, p, MAXPGPATH);
 		pfree(p);
 
-		file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+		file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
 		if (file < 0)
 			return -1;
 		need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..706a52b9f1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
@@ -27,6 +28,9 @@
 #include "utils/inval.h"
 
 
+/* GUCs */
+bool		io_data_direct = false;
+
 /*
  * This struct of function pointers defines the API between smgr.c and
  * any individual storage manager module.  Note that smgr subfunctions are
@@ -735,3 +739,19 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Check if this build allows smgr implementations to enable direct I/O.
+ */
+bool
+check_io_data_direct(bool *newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+	if (*newval)
+	{
+		GUC_check_errdetail("io_data_direct is not supported on this platform.");
+		return false;
+	}
+#endif
+	return true;
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..e324378ad4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1925,6 +1925,39 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_data_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+			gettext_noop("Access data files with direct I/O."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&io_data_direct,
+		false,
+		check_io_data_direct, NULL, NULL
+	},
+
+	{
+		{"io_wal_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+			gettext_noop("Write WAL files with direct I/O."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&io_wal_direct,
+		false,
+		check_io_wal_direct, NULL, NULL
+	},
+
+	{
+		{"io_wal_init_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+			gettext_noop("Initialize WAL files with direct I/O."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&io_wal_init_direct,
+		false,
+		check_io_wal_direct, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..6220370036 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -51,6 +51,8 @@ extern PGDLLIMPORT char *wal_consistency_checking_string;
 extern PGDLLIMPORT bool log_checkpoints;
 extern PGDLLIMPORT bool track_wal_io_timing;
 extern PGDLLIMPORT int wal_decode_buffer_size;
+extern PGDLLIMPORT bool io_wal_direct;
+extern PGDLLIMPORT bool io_wal_init_direct;
 
 extern PGDLLIMPORT int CheckPointSegments;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d..283ff21e31 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,7 @@
 #define FD_H
 
 #include <dirent.h>
+#include <fcntl.h>
 
 typedef enum RecoveryInitSyncMethod
 {
@@ -82,9 +83,10 @@ extern PGDLLIMPORT int max_safe_fds;
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
  * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
  * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix).  We can only use it if the compiler will correctly align
+ * PGAlignedBlock for us, though.
  */
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x80000000
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ef75934a16 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,10 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
+
+/* GUCs */
+extern PGDLLIMPORT bool io_data_direct;
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -107,5 +111,6 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
+extern bool check_io_data_direct(bool *newval, void **extra, GucSource source);
 
 #endif							/* SMGR_H */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..a9748f6b34 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -59,6 +59,8 @@ extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
+extern bool check_io_data_direct(bool *newval, void **extra, GucSource source);
+extern bool check_io_wal_direct(bool *newval, void **extra, GucSource source);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
 extern bool check_locale_monetary(char **newval, void **extra, GucSource source);
-- 
2.35.1

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#1)
Re: Direct I/O

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]. It adds three new settings, defaulting to off:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

--
Justin

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#2)
Re: Direct I/O

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)

Interesting idea. So "direct_io = data, wal, wal_init", or maybe that
should be spelled io_direct. ("Direct I/O" is a common term of art,
but we also have some more io_XXX GUCs in later patches, so it's hard
to choose...)

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

#4Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#3)
Re: Direct I/O

Hi,

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.

In the past more complicated GUCs have not been well received, but it does
seem like a nice way to reduce the amount of redundant stuff.

Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?

(By "better", I mean reducing the number of new GUCs - which is less
important for developer GUCs anyway.)

FWIW, if / once we get to actual AIO, at least some of these would stop being
developer-only GUCs. There's substantial performance benefits in using DIO
with AIO. Buffered IO requires the CPU to copy the data from the userspace
into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more
useful work. Buffered IO tops out much much earlier than AIO + DIO, and
unfortunately tops out at much lower speeds on server CPUs.

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: Direct I/O

Hi,

On 2022-11-01 15:54:02 -0700, Andres Freund wrote:

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Although it might actually be worth aligning fully everywhere - there's a
noticable performance difference for buffered read IO.

I benchmarked this on my workstation and laptop.

I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then
measured performance of reading 8192 bytes into the buffer at different
offsets. Each time I copied 16GiB in total. Within a program invocation I
benchmarked each offset 4 times, threw away the worst measurement, and
averaged the rest. Then used the best of three program invocations.

workstation with dual xeon Gold 5215:

turbo on turbo off
offset GiB/s GiB/s
0 18.358 13.528
8 15.361 11.472
9 15.330 11.418
32 17.583 13.097
512 17.707 13.229
513 15.890 11.852
4096 18.176 13.568
8192 18.088 13.566
2Mib 18.658 13.496

laptop with i9-9880H:

turbo on turbo off
offset GiB/s GiB/s
0 33.589 17.160
8 28.045 14.301
9 27.582 14.318
32 31.797 16.711
512 32.215 16.810
513 28.864 14.932
4096 32.503 17.266
8192 32.871 17.277
2Mib 32.657 17.262

Seems pretty clear that using 4096 byte alignment is worth it.

Greetings,

Andres Freund

#6Jim Nasby
nasbyj@amazon.com
In reply to: Thomas Munro (#1)
Re: Direct I/O

On 11/1/22 2:36 AM, Thomas Munro wrote:

Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the

This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:

1) work_mem. This is a significant impediment to scaling shared buffers
the way you'd want to.

2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large
systems with 75% of memory going to shared_buffers, that becomes a very
significant problem, especially when the backend running the clock sweep
is doing so in order to perform an operation like a b-tree page split. I
suspect it shouldn't be too hard to deal with this issue by just having
bgwriter or another bgworker proactively ensuring some reasonable number
of buffers with usage_count=0 exist.

One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.

#7Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#6)
Re: Direct I/O

Hi,

On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:

On 11/1/22 2:36 AM, Thomas Munro wrote:

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the

This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:

1) work_mem. This is a significant impediment to scaling shared buffers the
way you'd want to.

I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.

2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large systems
with 75% of memory going to shared_buffers, that becomes a very significant
problem, especially when the backend running the clock sweep is doing so in
order to perform an operation like a b-tree page split. I suspect it
shouldn't be too hard to deal with this issue by just having bgwriter or
another bgworker proactively ensuring some reasonable number of buffers with
usage_count=0 exist.

I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.

I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.

I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...

One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.

Which would be a very bad idea for now.... Thomas does have a patch for moving
them into the main buffer pool.

Greetings,

Andres Freund

#8John Naylor
john.naylor@enterprisedb.com
In reply to: Thomas Munro (#1)
Re: Direct I/O

On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

--
John Naylor
EDB: http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: John Naylor (#8)
Re: Direct I/O

Hi,

On 2022-11-10 14:26:20 +0700, John Naylor wrote:

On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.

The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#4)
4 attachment(s)
Re: Direct I/O

On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.

Done as io_direct=data,wal,wal_init. Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.

Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?

Makes sense. The only tricky question was where to store the GUC. I
went for fd.c for now, but it doesn't seem quite right...

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Done. The tests take 2-3x as long depending on the OS.

Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.

Done.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there. We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed. For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).

V2 comprises:

0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

Attachments:

v2-0001-Add-allocator-support-for-larger-allocation-align.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-allocator-support-for-larger-allocation-align.patchDownload
From 9af1dcc3ce36ce18e011183d5f2a97cdc07fe396 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Oct 2022 09:47:45 -0700
Subject: [PATCH v2 1/4] Add allocator support for larger allocation alignment
 & use for IO

---
 src/backend/utils/cache/catcache.c       |   5 +-
 src/backend/utils/mmgr/Makefile          |   1 +
 src/backend/utils/mmgr/alignedalloc.c    | 110 ++++++++++++++++++
 src/backend/utils/mmgr/mcxt.c            | 141 +++++++++++++++++++++--
 src/backend/utils/mmgr/meson.build       |   1 +
 src/include/utils/memutils_internal.h    |  13 ++-
 src/include/utils/memutils_memorychunk.h |   2 +-
 src/include/utils/palloc.h               |   3 +
 8 files changed, 263 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/mmgr/alignedalloc.c

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 30ef0ba39c..9e635177c8 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -763,7 +763,6 @@ InitCatCache(int id,
 {
 	CatCache   *cp;
 	MemoryContext oldcxt;
-	size_t		sz;
 	int			i;
 
 	/*
@@ -807,8 +806,8 @@ InitCatCache(int id,
 	 *
 	 * Note: we rely on zeroing to initialize all the dlist headers correctly
 	 */
-	sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
-	cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
+	cp = (CatCache *) palloc_aligned(sizeof(CatCache), PG_CACHE_LINE_SIZE,
+									 MCXT_ALLOC_ZERO);
 	cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
 
 	/*
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index 3b4cfdbd52..dae3432c98 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	alignedalloc.o \
 	aset.o \
 	dsa.o \
 	freepage.o \
diff --git a/src/backend/utils/mmgr/alignedalloc.c b/src/backend/utils/mmgr/alignedalloc.c
new file mode 100644
index 0000000000..97cb1d2b0d
--- /dev/null
+++ b/src/backend/utils/mmgr/alignedalloc.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * alignedalloc.c
+ *	  Allocator functions to implement palloc_aligned
+ *
+ * This is not a fully fledged MemoryContext type as there is no means to
+ * create a MemoryContext of this type.  The code here only serves to allow
+ * operations such as pfree() and repalloc() to work correctly on a memory
+ * chunk that was allocated by palloc_aligned().
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mmgr/alignedalloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/memdebug.h"
+#include "utils/memutils_memorychunk.h"
+
+void
+AlignedAllocFree(void *pointer)
+{
+	MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+	void *unaligned;
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/*
+	 * Test for someone scribbling on unused space in chunk.  We don't have
+	 * the ability to include the context name here, so just mention that it's
+	 * an aligned chunk.
+	 */
+	if (!sentinel_ok(pointer, chunk->requested_size))
+		elog(WARNING, "detected write past %zu-byte aligned chunk end at %p",
+			 MemoryChunkGetValue(chunk), chunk);
+#endif
+
+	Assert(!MemoryChunkIsExternal(chunk));
+
+	/* obtain the original (unaligned) allocated pointer */
+	unaligned = MemoryChunkGetBlock(chunk);
+
+	pfree(unaligned);
+}
+
+void *
+AlignedAllocRealloc(void *pointer, Size size)
+{
+	MemoryChunk	   *redirchunk = PointerGetMemoryChunk(pointer);
+	Size		alignto = MemoryChunkGetValue(redirchunk);
+	void	   *unaligned = MemoryChunkGetBlock(redirchunk);
+	MemoryContext	ctx;
+	Size			old_size;
+	void		   *newptr;
+
+	/* sanity check this is a power of 2 value */
+	Assert((alignto & (alignto - 1)) == 0);
+
+	/*
+	 * Determine the size of the original allocation.  We can't determine this
+	 * exactly as GetMemoryChunkSpace() returns the total space used for the
+	 * allocation, which for contexts like aset includes rounding up to the
+	 * next power of 2.  However, this value is just used to memcpy() the old
+	 * data into the new allocation, so we only need to concern ourselves with
+	 * not reading beyond the end of the original allocation's memory.  The
+	 * drawback here is that we may copy more bytes than we need to, which
+	 * amounts only to wasted effort.
+	 */
+#ifndef MEMORY_CONTEXT_CHECKING
+	old_size = GetMemoryChunkSpace(unaligned) -
+		((char *) pointer - (char *) PointerGetMemoryChunk(unaligned));
+#else
+	old_size = redirchunk->requested_size;
+#endif
+
+	ctx = GetMemoryChunkContext(unaligned);
+	newptr = MemoryContextAllocAligned(ctx, size, alignto, 0);
+
+	/*
+	 * We may memcpy beyond the end of the orignal allocation request size, so
+	 * we must mark the entire allocation as defined.
+	 */
+	VALGRIND_MAKE_MEM_DEFINED(pointer, old_size);
+	memcpy(newptr, pointer, Min(size, old_size));
+	pfree(unaligned);
+
+	return newptr;
+}
+
+MemoryContext
+AlignedAllocGetChunkContext(void *pointer)
+{
+	MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+
+	Assert(!MemoryChunkIsExternal(chunk));
+
+	return GetMemoryChunkContext(MemoryChunkGetBlock(chunk));
+}
+
+Size
+AlignedGetChunkSpace(void *pointer)
+{
+	MemoryChunk	   *redirchunk = PointerGetMemoryChunk(pointer);
+	void	   *unaligned = MemoryChunkGetBlock(redirchunk);
+
+	return GetMemoryChunkSpace(unaligned);
+}
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 57bd6690ca..c1e3e88b49 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -30,6 +30,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/memutils_internal.h"
+#include "utils/memutils_memorychunk.h"
 
 
 static void BogusFree(void *pointer);
@@ -84,6 +85,21 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_SLAB_ID].check = SlabCheck,
 #endif
 
+	/* alignedalloc.c */
+	[MCTX_ALIGNED_REDIRECT_ID].alloc = NULL,	/* not required */
+	[MCTX_ALIGNED_REDIRECT_ID].free_p = AlignedAllocFree,
+	[MCTX_ALIGNED_REDIRECT_ID].realloc = AlignedAllocRealloc,
+	[MCTX_ALIGNED_REDIRECT_ID].reset = NULL,	/* not required */
+	[MCTX_ALIGNED_REDIRECT_ID].delete_context = NULL,	/* not required */
+	[MCTX_ALIGNED_REDIRECT_ID].get_chunk_context = AlignedAllocGetChunkContext,
+	[MCTX_ALIGNED_REDIRECT_ID].get_chunk_space = AlignedGetChunkSpace,
+	[MCTX_ALIGNED_REDIRECT_ID].is_empty = NULL, /* not required */
+	[MCTX_ALIGNED_REDIRECT_ID].stats = NULL,	/* not required */
+#ifdef MEMORY_CONTEXT_CHECKING
+	[MCTX_ALIGNED_REDIRECT_ID].check = NULL,	/* not required */
+#endif
+
+
 	/*
 	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
 	 * fail cleanly if a bogus pointer is passed to pfree or the like.  It
@@ -110,11 +126,6 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
 	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
 	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
-
-	[MCTX_UNUSED5_ID].free_p = BogusFree,
-	[MCTX_UNUSED5_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
 };
 
 /*
@@ -1298,6 +1309,111 @@ palloc_extended(Size size, int flags)
 	return ret;
 }
 
+/*
+ * MemoryContextAllocAligned
+ *		Allocate 'size' bytes of memory in 'context' aligned to 'alignto'
+ *		bytes.
+ *
+ * 'alignto' must be a power of 2.
+ * 'flags' may be 0 or set the same as MemoryContextAllocExtended().
+ */
+void *
+MemoryContextAllocAligned(MemoryContext context,
+						  Size size, Size alignto, int flags)
+{
+	MemoryChunk *alignedchunk;
+	Size		alloc_size;
+	void	   *unaligned;
+	void	   *aligned;
+
+	/* wouldn't make much sense to waste that much space */
+	Assert(alignto < (128 * 1024 * 1024));
+
+	/* ensure alignto is a power of 2 */
+	Assert((alignto & (alignto - 1)) == 0);
+
+	/*
+	 * If the alignment requirements are less than what we already guarantee
+	 * then just use the standard allocation function.
+	 */
+	if (unlikely(alignto <= MAXIMUM_ALIGNOF))
+		return MemoryContextAllocExtended(context, size, flags);
+
+	/*
+	 * We implement aligned pointers by simply allocating enough memory for
+	 * the requested size plus the alignment and an additional "redirection"
+	 * MemoryChunk.  This additional MemoryChunk is required for operations
+	 * such as pfree when used on the pointer returned by this function.  We
+	 * use this redirection MemoryChunk in order to find the pointer to the
+	 * memory that was returned by the MemoryContextAllocExtended call below.
+	 * We do that by "borrowing" the block offset field and instead of using
+	 * that to find the offset into the owning block, we use it to find the
+	 * original allocated address.
+	 *
+	 * Here we must allocate enough extra memory so that we can still align
+	 * the pointer returned by MemoryContextAllocExtended and also have enough
+	 * space for the redirection MemoryChunk.  Since allocations will already
+	 * be at least aligned by MAXIMUM_ALIGNOF, we can subtract that amount
+	 * from the allocation size to save a little memory.
+	 */
+	alloc_size = size + alignto + sizeof(MemoryChunk) - MAXIMUM_ALIGNOF;
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* ensure there's space for a sentinal byte */
+	alloc_size += 1;
+#endif
+
+	/* perform the actual allocation */
+	unaligned = MemoryContextAllocExtended(context, alloc_size, flags);
+
+	/* set the aligned pointer */
+	aligned = (void *) TYPEALIGN(alignto, (char *) unaligned +
+								 sizeof(MemoryChunk));
+
+	alignedchunk = PointerGetMemoryChunk(aligned);
+
+	/*
+	 * We set the redirect MemoryChunk so that the block offset calculation is
+	 * used to point back to the 'unaligned' allocated chunk.  This allows us
+	 * to use MemoryChunkGetBlock() to find the unaligned chunk when we need
+	 * to perform operations such as pfree() and repalloc().
+	 *
+	 * We store 'alignto' in the MemoryChunk's 'value' so that we know what
+	 * the alignment was set to should we ever be asked to realloc this
+	 * pointer.
+	 */
+	MemoryChunkSetHdrMask(alignedchunk, unaligned, alignto,
+						  MCTX_ALIGNED_REDIRECT_ID);
+
+	/* double check we produced a correctly aligned pointer */
+	Assert((char *) TYPEALIGN(alignto, aligned) == aligned);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	alignedchunk->requested_size = size;
+	/* set mark to catch clobber of "unused" space */
+	set_sentinel(aligned, size);
+#endif
+
+	/* Mark the bytes before the redirection header as noaccess */
+	VALGRIND_MAKE_MEM_NOACCESS(unaligned,
+							   (char *) alignedchunk - (char *) unaligned);
+	return aligned;
+}
+
+/*
+ * palloc_aligned
+ *		Allocate 'size' bytes returning a pointer that's aligned to the
+ *		'alignto' boundary.
+ *
+ * 'alignto' must be a power of 2.
+ * 'flags' may be 0 or set the same as MemoryContextAllocExtended().
+ */
+void *
+palloc_aligned(Size size, Size alignto, int flags)
+{
+	return MemoryContextAllocAligned(CurrentMemoryContext, size, alignto, flags);
+}
+
 /*
  * pfree
  *		Release an allocated chunk.
@@ -1306,11 +1422,16 @@ void
 pfree(void *pointer)
 {
 #ifdef USE_VALGRIND
+	MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
 	MemoryContext context = GetMemoryChunkContext(pointer);
 #endif
 
 	MCXT_METHOD(pointer, free_p) (pointer);
-	VALGRIND_MEMPOOL_FREE(context, pointer);
+
+#ifdef USE_VALGRIND
+	if (method != MCTX_ALIGNED_REDIRECT_ID)
+		VALGRIND_MEMPOOL_FREE(context, pointer);
+#endif
 }
 
 /*
@@ -1320,6 +1441,9 @@ pfree(void *pointer)
 void *
 repalloc(void *pointer, Size size)
 {
+#ifdef USE_VALGRIND
+	MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
+#endif
 #if defined(USE_ASSERT_CHECKING) || defined(USE_VALGRIND)
 	MemoryContext context = GetMemoryChunkContext(pointer);
 #endif
@@ -1346,7 +1470,10 @@ repalloc(void *pointer, Size size)
 						   size, cxt->name)));
 	}
 
-	VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
+#ifdef USE_VALGRIND
+	if (method != MCTX_ALIGNED_REDIRECT_ID)
+		VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
+#endif
 
 	return ret;
 }
diff --git a/src/backend/utils/mmgr/meson.build b/src/backend/utils/mmgr/meson.build
index 641bb181ba..7cf4d6dcc8 100644
--- a/src/backend/utils/mmgr/meson.build
+++ b/src/backend/utils/mmgr/meson.build
@@ -1,4 +1,5 @@
 backend_sources += files(
+  'alignedalloc.c',
   'aset.c',
   'dsa.c',
   'freepage.c',
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index bc2cbdd506..450bcba3ed 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -70,6 +70,15 @@ extern void SlabStats(MemoryContext context,
 extern void SlabCheck(MemoryContext context);
 #endif
 
+/*
+ * These functions support the implementation of palloc_aligned() and are not
+ * part of a fully-fledged MemoryContext type.
+ */
+extern void AlignedAllocFree(void *pointer);
+extern void *AlignedAllocRealloc(void *pointer, Size size);
+extern MemoryContext AlignedAllocGetChunkContext(void *pointer);
+extern Size AlignedGetChunkSpace(void *pointer);
+
 /*
  * MemoryContextMethodID
  *		A unique identifier for each MemoryContext implementation which
@@ -92,8 +101,8 @@ typedef enum MemoryContextMethodID
 	MCTX_ASET_ID,
 	MCTX_GENERATION_ID,
 	MCTX_SLAB_ID,
-	MCTX_UNUSED4_ID,			/* available */
-	MCTX_UNUSED5_ID				/* 111 occurs in wipe_mem'd memory */
+	MCTX_ALIGNED_REDIRECT_ID,
+	MCTX_UNUSED4_ID				/* 111 occurs in wipe_mem'd memory */
 } MemoryContextMethodID;
 
 /*
diff --git a/src/include/utils/memutils_memorychunk.h b/src/include/utils/memutils_memorychunk.h
index 2eefc138e3..38702efc58 100644
--- a/src/include/utils/memutils_memorychunk.h
+++ b/src/include/utils/memutils_memorychunk.h
@@ -156,7 +156,7 @@ MemoryChunkSetHdrMask(MemoryChunk *chunk, void *block,
 {
 	Size		blockoffset = (char *) chunk - (char *) block;
 
-	Assert((char *) chunk > (char *) block);
+	Assert((char *) chunk >= (char *) block);
 	Assert(blockoffset <= MEMORYCHUNK_MAX_BLOCKOFFSET);
 	Assert(value <= MEMORYCHUNK_MAX_VALUE);
 	Assert((int) methodid <= MEMORY_CONTEXT_METHODID_MASK);
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 72d4e70dc6..b1ac63b2ee 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -73,10 +73,13 @@ extern void *MemoryContextAllocZero(MemoryContext context, Size size);
 extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size);
 extern void *MemoryContextAllocExtended(MemoryContext context,
 										Size size, int flags);
+extern void *MemoryContextAllocAligned(MemoryContext context,
+									   Size size, Size alignto, int flags);
 
 extern void *palloc(Size size);
 extern void *palloc0(Size size);
 extern void *palloc_extended(Size size, int flags);
+extern void *palloc_aligned(Size size, Size alignto, int flags);
 extern pg_nodiscard void *repalloc(void *pointer, Size size);
 extern pg_nodiscard void *repalloc_extended(void *pointer,
 											Size size, int flags);
-- 
2.35.1

v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From caa6cbeb3b3f86c48c90513ee184aca500b1f703 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v2 2/4] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

In order to be allowed to use O_DIRECT in a later commit, we need to
align buffers to the virtual memory page size.  O_DIRECT would either
fail to work or fail to work efficiently without that on various
platforms.  Even without O_DIRECT, aligning on memory pages improves
traditional buffered I/O performance.

The alignment size is set to 4096, which is enough for currently known
systems.  There is no standard governing the requirements for O_DIRECT so
it's possible we might have to reconsider this approach or fail to work
on some exotic system, but for now this simplistic approach works and
it can be changed at compile time.

Adjust all call sites that allocate heap memory for file I/O to use the
new palloc_aligned() or MemoryContextAllocAligned() functions.  For
stack-allocated buffers, introduce PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  Also align the main
buffer pool in shared memory.

If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0.  (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)

Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
 contrib/bloom/blinsert.c                  |  2 +-
 contrib/pg_prewarm/pg_prewarm.c           |  2 +-
 src/backend/access/gist/gistbuild.c       |  9 +++---
 src/backend/access/hash/hashpage.c        |  2 +-
 src/backend/access/heap/rewriteheap.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   |  2 +-
 src/backend/access/nbtree/nbtree.c        |  2 +-
 src/backend/access/nbtree/nbtsort.c       |  8 ++++--
 src/backend/access/spgist/spginsert.c     |  2 +-
 src/backend/access/transam/generic_xlog.c | 13 ++++++---
 src/backend/access/transam/xlog.c         |  9 +++---
 src/backend/catalog/storage.c             |  2 +-
 src/backend/storage/buffer/buf_init.c     | 10 +++++--
 src/backend/storage/buffer/bufmgr.c       |  2 +-
 src/backend/storage/buffer/localbuf.c     |  7 +++--
 src/backend/storage/file/buffile.c        |  6 ++++
 src/backend/storage/freespace/freespace.c |  2 +-
 src/backend/storage/page/bufpage.c        |  5 +++-
 src/backend/storage/smgr/md.c             | 15 +++++++++-
 src/backend/utils/sort/logtape.c          |  2 +-
 src/bin/pg_checksums/pg_checksums.c       |  2 +-
 src/bin/pg_rewind/local_source.c          |  4 +--
 src/bin/pg_upgrade/file.c                 |  4 +--
 src/common/file_utils.c                   |  2 +-
 src/include/c.h                           | 34 +++++++++++++++++------
 src/include/pg_config_manual.h            |  7 +++++
 src/include/storage/fd.h                  |  5 ++--
 src/tools/pgindent/typedefs.list          |  1 +
 28 files changed, 114 insertions(+), 49 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..53cc617a66 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	BloomFillMetapage(index, metapage);
 
 	/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index caff5c4a80..f50aa69eb2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
 	PREWARM_BUFFER
 } PrewarmType;
 
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..d3d7d836e9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
-	page = palloc0(BLCKSZ);
+	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
 			   page, true);
 	state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+			levelstate->pages[levelstate->current_page] =
+				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc0(BLCKSZ);
+		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc(BLCKSZ);
+			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 55b2929ad5..147af95e92 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
 _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
-	PGAlignedBlock zerobuf;
+	PGIOAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..23d966940e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc(BLCKSZ);
+	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
 	state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..3bd65b275b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -620,7 +620,7 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGIOAlignedBlock pg;
 	SMgrRelation reln;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..e8ac7390ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..5e3c461f6f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	while (blkno > wstate->btws_pages_written)
 	{
 		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+														  PG_IO_ALIGN_SIZE,
+														  MCXT_ALLOC_ZERO);
 		/* don't set checksum for all-zero page */
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..6f14b41329 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
 	Page		page;
 
 	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	SpGistInitMetapage(page);
 
 	/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6db9a1fca1..458e270d55 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
 	char		delta[MAX_DELTA_SIZE];	/* delta between page images */
 } PageData;
 
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction.  Must be allocated at an I/O
+ * aligned address.
+ */
 struct GenericXLogState
 {
+	/* Page images (properly aligned, must be first) */
+	PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 	/* Info about each page, see above */
 	PageData	pages[MAX_GENERIC_XLOG_PAGES];
 	bool		isLogged;
-	/* Page images (properly aligned) */
-	PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 };
 
 static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
 	GenericXLogState *state;
 	int			i;
 
-	state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+	state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+												PG_IO_ALIGN_SIZE,
+												0);
 	state->isLogged = RelationNeedsWAL(relation);
 
 	for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..a75c6813a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4511,7 +4511,7 @@ XLOGShmemSize(void)
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
+	size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
 	/* and the buffers themselves */
 	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
@@ -4608,10 +4608,11 @@ XLOGShmemInit(void)
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * This simplifies some calculations in XLOG insertion.  We also need I/O
+	 * alignment for O_DIRECT, but that's also a power of two and usually
+	 * smaller.  Take the larger of the two alignment requirements.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+	allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
 	XLogCtl->pages = allocptr;
 	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..0c5ac1f94b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..76a30d44b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
 						NBuffers * sizeof(BufferDescPadded),
 						&foundDescs);
 
+	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		ShmemInitStruct("Buffer Blocks",
-						NBuffers * (Size) BLCKSZ, &foundBufs);
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStruct("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
 	/* to allow aligning buffer descriptors */
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 
-	/* size of data pages */
+	/* size of data pages, plus alignment padding */
+	size = add_size(size, PG_IO_ALIGN_SIZE);
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..aba07e94c9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3717,7 +3717,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bool		use_wal;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..735f7c6018 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,11 @@ GetLocalBufferStorage(void)
 		/* And don't overflow MaxAllocSize, either */
 		num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
 
-		cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
-												num_bufs * BLCKSZ);
+		/* Buffers should be I/O aligned. */
+		cur_block = (char *)
+			TYPEALIGN(PG_IO_ALIGN_SIZE,
+					  MemoryContextAlloc(LocalBufferContext,
+										 num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
 		next_buf_in_block = 0;
 		num_bufs_in_block = num_bufs;
 	}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b0b4eeb3bd..2261c3ebe3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
 	off_t		curOffset;		/* offset part of current pos */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+	 * wasting per-file alignment padding when some users create many
+	 * files.
+	 */
 	PGAlignedBlock buffer;
 };
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a6b0533103..7230d538fd 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,7 +608,7 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGIOAlignedBlock pg;
 	SMgrRelation reln;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..0728ce30c0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
 	 * and second to avoid wasting space in processes that never call this.
 	 */
 	if (pageCopy == NULL)
-		pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+											 BLCKSZ,
+											 PG_IO_ALIGN_SIZE,
+											 0);
 
 	memcpy(pageCopy, (char *) page, BLCKSZ);
 	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 14b6fa0fd9..3e034afdf1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum >= mdnblocks(reln, forknum));
@@ -675,6 +679,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
 										reln->smgr_rlocator.locator.spcOid,
 										reln->smgr_rlocator.locator.dbOid,
@@ -740,6 +748,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
@@ -1294,7 +1306,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 */
 			if (nblocks < ((BlockNumber) RELSEG_SIZE))
 			{
-				char	   *zerobuf = palloc0(BLCKSZ);
+				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+													 MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..6ba5030a5f 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 	 */
 	while (blocknum > lts->nBlocksWritten)
 	{
-		PGAlignedBlock zerobuf;
+		PGIOAlignedBlock zerobuf;
 
 		MemSet(zerobuf.data, 0, sizeof(zerobuf));
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 7f3d5fc040..be6bae2b78 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
 static void
 scan_file(const char *fn, int segmentno)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..83b37a1e91 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
 local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	size_t		written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
 						size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	off_t		begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 079fbda838..b5809236f6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 {
 	int			src_fd;
 	int			dst_fd;
-	PGAlignedBlock buffer;
-	PGAlignedBlock new_vmbuf;
+	PGIOAlignedBlock buffer;
+	PGIOAlignedBlock new_vmbuf;
 	ssize_t		totalBytesRead = 0;
 	ssize_t		src_filesize;
 	int			rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d8507d88a5..83ef0609a2 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	PGIOAlignedBlock zbuffer;		/* worth BLCKSZ */
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
diff --git a/src/include/c.h b/src/include/c.h
index 98cdd285dd..9df92fb40e 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1068,14 +1068,11 @@ extern void ExceptionalCondition(const char *conditionName,
 
 /*
  * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes.  Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware.  (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.)  We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page.  Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware.  We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf.  configure's
+ * computation of MAXIMUM_ALIGNOF).
  */
 typedef union PGAlignedBlock
 {
@@ -1084,9 +1081,30 @@ typedef union PGAlignedBlock
 	int64		force_align_i64;
 } PGAlignedBlock;
 
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function.  If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE.  This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+	char		data[BLCKSZ];
+	double		force_align_d;
+	int64		force_align_i64;
+} PGIOAlignedBlock;
+
 /* Same, but for an XLOG_BLCKSZ-sized buffer */
 typedef union PGAlignedXLogBlock
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
 	char		data[XLOG_BLCKSZ];
 	double		force_align_d;
 	int64		force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..323a4cfb4f 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,13 @@
  */
 #define PG_CACHE_LINE_SIZE		128
 
+/*
+ * Assumed memory alignment requirement for direct I/O.  On currently known
+ * systems this size applies, even for memory that is backed by larger virtual
+ * memory pages.
+ */
+#define PG_IO_ALIGN_SIZE		4096
+
 /*
  *------------------------------------------------------------------------
  * The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7144fc9f60..85ef12c440 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
  * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
  * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix).  We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
  */
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..9a77664154 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1687,6 +1687,7 @@ PGEventResultDestroy
 PGFInfoFunction
 PGFileType
 PGFunction
+PGIOAlignedBlock
 PGLZ_HistEntry
 PGLZ_Strategy
 PGMessageField
-- 
2.35.1

v2-0003-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Add-io_direct-setting-developer-only.patchDownload
From e6692d744a7d041519e9c0998ce9f34aabc63c1e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v2 3/4] Add io_direct setting (developer-only).

Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files.  This hurts performance currently and is not
intended for end-users yet.  Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.

This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 33 +++++++
 src/backend/access/transam/xlog.c             | 37 ++++----
 src/backend/access/transam/xlogprefetcher.c   |  2 +-
 src/backend/storage/buffer/bufmgr.c           | 16 ++--
 src/backend/storage/buffer/localbuf.c         |  7 +-
 src/backend/storage/file/fd.c                 | 88 +++++++++++++++++++
 src/backend/storage/smgr/md.c                 | 24 +++--
 src/backend/storage/smgr/smgr.c               |  1 +
 src/backend/utils/misc/guc_tables.c           | 12 +++
 src/include/storage/fd.h                      |  8 ++
 src/include/storage/smgr.h                    |  1 +
 src/include/utils/guc_hooks.h                 |  3 +
 src/test/modules/test_misc/meson.build        |  1 +
 src/test/modules/test_misc/t/004_io_direct.pl | 40 +++++++++
 14 files changed, 239 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e4145979d..766d20f2ea 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11033,6 +11033,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-io-direct" xreflabel="io_direct">
+      <term><varname>io_direct</varname> (<type>string</type>)
+      <indexterm>
+        <primary><varname>io_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects for relation data and WAL
+        files using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+       </para>
+       <para>
+        May be set to an empty string (the default) to disable use of direct
+        I/O, or a comma-separated list of types of files for which direct I/O
+        is enabled.  The valid types of file are <literal>data</literal> for
+        main data files, <literal>wal</literal> for WAL files, and
+        <literal>wal_init</literal> for WAL files when being initially
+        allocated.
+       </para>
+       <para>
+        Some operating systems and file systems do not support direct I/O, so
+        non-default settings may be rejected at startup, or produce I/O errors
+        at runtime.
+       </para>
+       <para>
+        Currently this feature reduces performance, and is intended for
+        developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
       <term><varname>post_auth_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a75c6813a4..08a2f7a558 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2925,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	XLogSegNo	max_segno;
 	int			fd;
 	int			save_errno;
+	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
 
 	Assert(logtli != 0);
 
@@ -2957,8 +2958,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 
 	unlink(tmppath);
 
+	if (io_direct_flags & IO_DIRECT_WAL_INIT)
+		open_flags |= PG_O_DIRECT;
+
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = BasicOpenFile(tmppath, open_flags);
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3350,7 +3354,7 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
@@ -4450,7 +4454,6 @@ show_in_hot_standby(void)
 	return RecoveryInProgress() ? "on" : "off";
 }
 
-
 /*
  * Read the control file, set respective GUCs.
  *
@@ -8034,35 +8037,27 @@ xlog_redo(XLogReaderState *record)
 }
 
 /*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
  */
 static int
 get_sync_bit(int method)
 {
 	int			o_direct_flag = 0;
 
-	/* If fsync is disabled, never open in sync mode */
-	if (!enableFsync)
-		return 0;
-
 	/*
-	 * Optimize writes by bypassing kernel cache with O_DIRECT when using
-	 * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
-	 * otherwise the archive command or walsender process will read the WAL
-	 * soon after writing it, which is guaranteed to cause a physical read if
-	 * we bypassed the kernel cache. We also skip the
-	 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
-	 * reason.
-	 *
-	 * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+	 * Use O_DIRECT if requested, except in walreceiver process.  The WAL
 	 * written by walreceiver is normally read by the startup process soon
-	 * after it's written. Also, walreceiver performs unaligned writes, which
+	 * after it's written.  Also, walreceiver performs unaligned writes, which
 	 * don't work with O_DIRECT, so it is required for correctness too.
 	 */
-	if (!XLogIsNeeded() && !AmWalReceiverProcess())
+	if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
 		o_direct_flag = PG_O_DIRECT;
 
+	/* If fsync is disabled, never open in sync mode */
+	if (!enableFsync)
+		return o_direct_flag;
+
 	switch (method)
 	{
 			/*
@@ -8074,7 +8069,7 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
-			return 0;
+			return o_direct_flag;
 #ifdef O_SYNC
 		case SYNC_METHOD_OPEN:
 			return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..992256dd09 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				block->prefetch_buffer = InvalidBuffer;
 				return LRQ_NEXT_IO;
 			}
-			else
+			else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
 			{
 				/*
 				 * This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba07e94c9..11c8187a55 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,8 +535,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
 		 */
-		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr_reln, forkNum, blockNum))
+		{
 			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 	else
@@ -582,11 +585,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
  * the kernel and therefore didn't really initiate I/O, and no way to know when
  * the I/O completes other than using synchronous ReadBuffer().
  *
- * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and
  * USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery.  (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery.  (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
  */
 PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4911,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 {
 	PendingWriteback *pending;
 
+	if (io_direct_flags & IO_DIRECT_DATA)
+		return;
+
 	/*
 	 * Add buffer to the pending writeback array, unless writeback control is
 	 * disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 735f7c6018..b01e319641 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	{
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
-		smgrprefetch(smgr, forkNum, blockNum);
-		result.initiated_io = true;
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr, forkNum, blockNum))
+		{
+			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f6c9382023..0829e9b8df 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
+#include "utils/guc_hooks.h"
 #include "utils/resowner_private.h"
+#include "utils/varlena.h"
 
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
 
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int			io_direct_flags;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -2021,6 +2026,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 	if (nbytes <= 0)
 		return;
 
+#ifdef PG_O_DIRECT
+	if (VfdCache[file].fileFlags & PG_O_DIRECT)
+		return;
+#endif
+
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
 		return;
@@ -3737,3 +3747,81 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+	if (*newval)
+	{
+		GUC_check_errdetail("io_direct is not supported on this platform.");
+		return false;
+	}
+#else
+	List	   *list;
+	ListCell   *l;
+	int		   *flags;
+
+	if (!SplitGUCList(*newval, ',', &list))
+	{
+		GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+							"io_direct");
+		return false;
+	}
+
+	flags = guc_malloc(ERROR, sizeof(*flags));
+	*flags = 0;
+	foreach (l, list)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "data") == 0)
+			*flags |= IO_DIRECT_DATA;
+		else if (pg_strcasecmp(item, "wal") == 0)
+			*flags |= IO_DIRECT_WAL;
+		else if (pg_strcasecmp(item, "wal_init") == 0)
+			*flags |= IO_DIRECT_WAL_INIT;
+		else
+		{
+			GUC_check_errdetail("invalid option \"%s\"", item);
+			return false;
+		}
+	}
+
+	*extra = flags;
+
+	return true;
+#endif
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+	int	   *flags = (int *) extra;
+
+	io_direct_flags = *flags;
+}
+
+extern const char *
+show_io_direct(void)
+{
+	static char result[80];
+
+	result[0] = 0;
+	if (io_direct_flags & IO_DIRECT_DATA)
+		strcat(result, "data");
+	if (io_direct_flags & IO_DIRECT_WAL)
+	{
+		if (result[0])
+			strcat(result, ", ");
+		strcat(result, "wal");
+	}
+	if (io_direct_flags & IO_DIRECT_WAL_INIT)
+	{
+		if (result[0])
+			strcat(result, ", ");
+		strcat(result, "wal_init");
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3e034afdf1..38263f3d0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+	int		flags = O_RDWR | PG_BINARY;
+
+	if (io_direct_flags & IO_DIRECT_DATA)
+		flags |= PG_O_DIRECT;
+
+	return flags;
+}
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -527,7 +537,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 
 	if (fd < 0)
 	{
@@ -598,6 +608,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	off_t		seekpos;
 	MdfdVec    *v;
 
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 	if (v == NULL)
@@ -623,6 +635,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	/*
 	 * Issue flush requests in as few requests as possible; have to split at
 	 * segment boundaries though, since those are actually separate files.
@@ -1200,7 +1214,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
 
 	pfree(fullpath);
 
@@ -1410,7 +1424,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		strlcpy(path, p, MAXPGPATH);
 		pfree(p);
 
-		file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+		file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
 		if (file < 0)
 			return -1;
 		need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..4892920812 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec66..1de30ebbf1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -543,6 +543,7 @@ static char *locale_ctype;
 static char *server_encoding_string;
 static char *server_version_string;
 static int	server_version_num;
+static char *io_direct_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4468,6 +4469,17 @@ struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Use direct I/O for file access."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&io_direct_string,
+		"",
+		check_io_direct, assign_io_direct, show_io_direct
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 85ef12c440..0d65cb3c80 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,8 @@
 #define FD_H
 
 #include <dirent.h>
+#include <fcntl.h>
+
 
 typedef enum RecoveryInitSyncMethod
 {
@@ -54,10 +56,16 @@ typedef enum RecoveryInitSyncMethod
 typedef int File;
 
 
+#define IO_DIRECT_DATA			0x01
+#define IO_DIRECT_WAL			0x02
+#define IO_DIRECT_WAL_INIT		0x04
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ea7b3ff8dd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..61a7fd77b8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -154,5 +154,8 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
 										   GucSource source);
 extern void assign_wal_consistency_checking(const char *newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
+extern const char *show_io_direct(void);
 
 #endif							/* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index cfc830ff39..97162d2b8f 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -7,6 +7,7 @@ tests += {
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
+      't/004_io_direct.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..9a79fc8f9d
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,40 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error.  (illumos should
+# probably be in this list, but perl reports it as solaris.  Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it) and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+	plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;');
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
-- 
2.35.1

v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From 75dec1b3ffa91ca1279267092187191ca99fb713 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v2 4/4] XXX turn on direct I/O by default, just for CI

---
 src/backend/utils/misc/guc_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1de30ebbf1..0fc2185568 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4476,7 +4476,7 @@ struct config_string ConfigureNamesString[] =
 			GUC_NOT_IN_SAMPLE
 		},
 		&io_direct_string,
-		"",
+		"data,wal,wal_init",
 		check_io_direct, assign_io_direct, show_io_direct
 	},
 
-- 
2.35.1

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
3 attachment(s)
Re: Direct I/O

On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.

Attachments:

v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From f6adf05ffa5bdf43cd3ca2bcc4dba39d1474ce09 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v3 1/3] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

In order to be allowed to use O_DIRECT in a later commit, we need to
align buffers to the virtual memory page size.  O_DIRECT would either
fail to work or fail to work efficiently without that on various
platforms.  Even without O_DIRECT, aligning on memory pages improves
traditional buffered I/O performance.

The alignment size is set to 4096, which is enough for currently known
systems.  There is no standard governing the requirements for O_DIRECT so
it's possible we might have to reconsider this approach or fail to work
on some exotic system, but for now this simplistic approach works and
it can be changed at compile time.

Adjust all call sites that allocate heap memory for file I/O to use the
new palloc_aligned() or MemoryContextAllocAligned() functions.  For
stack-allocated buffers, introduce PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  Also align the main
buffer pool in shared memory.

If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0.  (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)

Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
 contrib/bloom/blinsert.c                  |  2 +-
 contrib/pg_prewarm/pg_prewarm.c           |  2 +-
 src/backend/access/gist/gistbuild.c       |  9 +++---
 src/backend/access/hash/hashpage.c        |  2 +-
 src/backend/access/heap/rewriteheap.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   |  2 +-
 src/backend/access/nbtree/nbtree.c        |  2 +-
 src/backend/access/nbtree/nbtsort.c       |  8 ++++--
 src/backend/access/spgist/spginsert.c     |  2 +-
 src/backend/access/transam/generic_xlog.c | 13 ++++++---
 src/backend/access/transam/xlog.c         |  9 +++---
 src/backend/catalog/storage.c             |  2 +-
 src/backend/storage/buffer/buf_init.c     | 10 +++++--
 src/backend/storage/buffer/bufmgr.c       |  2 +-
 src/backend/storage/buffer/localbuf.c     |  7 +++--
 src/backend/storage/file/buffile.c        |  6 ++++
 src/backend/storage/freespace/freespace.c |  2 +-
 src/backend/storage/page/bufpage.c        |  5 +++-
 src/backend/storage/smgr/md.c             | 15 +++++++++-
 src/backend/utils/sort/logtape.c          |  2 +-
 src/bin/pg_checksums/pg_checksums.c       |  2 +-
 src/bin/pg_rewind/local_source.c          |  4 +--
 src/bin/pg_upgrade/file.c                 |  4 +--
 src/common/file_utils.c                   |  2 +-
 src/include/c.h                           | 34 +++++++++++++++++------
 src/include/pg_config_manual.h            |  7 +++++
 src/include/storage/fd.h                  |  5 ++--
 src/tools/pgindent/typedefs.list          |  1 +
 28 files changed, 114 insertions(+), 49 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..53cc617a66 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	BloomFillMetapage(index, metapage);
 
 	/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index caff5c4a80..f50aa69eb2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
 	PREWARM_BUFFER
 } PrewarmType;
 
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..d3d7d836e9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
-	page = palloc0(BLCKSZ);
+	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
 			   page, true);
 	state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+			levelstate->pages[levelstate->current_page] =
+				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc0(BLCKSZ);
+		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc(BLCKSZ);
+			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 55b2929ad5..147af95e92 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
 _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
-	PGAlignedBlock zerobuf;
+	PGIOAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..23d966940e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc(BLCKSZ);
+	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
 	state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..3bd65b275b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -620,7 +620,7 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGIOAlignedBlock pg;
 	SMgrRelation reln;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..e8ac7390ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..5e3c461f6f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	while (blkno > wstate->btws_pages_written)
 	{
 		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+														  PG_IO_ALIGN_SIZE,
+														  MCXT_ALLOC_ZERO);
 		/* don't set checksum for all-zero page */
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..6f14b41329 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
 	Page		page;
 
 	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	SpGistInitMetapage(page);
 
 	/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6db9a1fca1..458e270d55 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
 	char		delta[MAX_DELTA_SIZE];	/* delta between page images */
 } PageData;
 
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction.  Must be allocated at an I/O
+ * aligned address.
+ */
 struct GenericXLogState
 {
+	/* Page images (properly aligned, must be first) */
+	PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 	/* Info about each page, see above */
 	PageData	pages[MAX_GENERIC_XLOG_PAGES];
 	bool		isLogged;
-	/* Page images (properly aligned) */
-	PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 };
 
 static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
 	GenericXLogState *state;
 	int			i;
 
-	state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+	state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+												PG_IO_ALIGN_SIZE,
+												0);
 	state->isLogged = RelationNeedsWAL(relation);
 
 	for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..172b4a2fcf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4502,7 +4502,7 @@ XLOGShmemSize(void)
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
+	size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
 	/* and the buffers themselves */
 	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
@@ -4599,10 +4599,11 @@ XLOGShmemInit(void)
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * This simplifies some calculations in XLOG insertion.  We also need I/O
+	 * alignment for O_DIRECT, but that's also a power of two and usually
+	 * smaller.  Take the larger of the two alignment requirements.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+	allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
 	XLogCtl->pages = allocptr;
 	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..0c5ac1f94b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..76a30d44b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
 						NBuffers * sizeof(BufferDescPadded),
 						&foundDescs);
 
+	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		ShmemInitStruct("Buffer Blocks",
-						NBuffers * (Size) BLCKSZ, &foundBufs);
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStruct("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
 	/* to allow aligning buffer descriptors */
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 
-	/* size of data pages */
+	/* size of data pages, plus alignment padding */
+	size = add_size(size, PG_IO_ALIGN_SIZE);
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..aba07e94c9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3717,7 +3717,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bool		use_wal;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..735f7c6018 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,11 @@ GetLocalBufferStorage(void)
 		/* And don't overflow MaxAllocSize, either */
 		num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
 
-		cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
-												num_bufs * BLCKSZ);
+		/* Buffers should be I/O aligned. */
+		cur_block = (char *)
+			TYPEALIGN(PG_IO_ALIGN_SIZE,
+					  MemoryContextAlloc(LocalBufferContext,
+										 num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
 		next_buf_in_block = 0;
 		num_bufs_in_block = num_bufs;
 	}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b0b4eeb3bd..2261c3ebe3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
 	off_t		curOffset;		/* offset part of current pos */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+	 * wasting per-file alignment padding when some users create many
+	 * files.
+	 */
 	PGAlignedBlock buffer;
 };
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a6b0533103..7230d538fd 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,7 +608,7 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGIOAlignedBlock pg;
 	SMgrRelation reln;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..0728ce30c0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
 	 * and second to avoid wasting space in processes that never call this.
 	 */
 	if (pageCopy == NULL)
-		pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+											 BLCKSZ,
+											 PG_IO_ALIGN_SIZE,
+											 0);
 
 	memcpy(pageCopy, (char *) page, BLCKSZ);
 	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 14b6fa0fd9..3e034afdf1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum >= mdnblocks(reln, forknum));
@@ -675,6 +679,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
 										reln->smgr_rlocator.locator.spcOid,
 										reln->smgr_rlocator.locator.dbOid,
@@ -740,6 +748,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
@@ -1294,7 +1306,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 */
 			if (nblocks < ((BlockNumber) RELSEG_SIZE))
 			{
-				char	   *zerobuf = palloc0(BLCKSZ);
+				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+													 MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..6ba5030a5f 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 	 */
 	while (blocknum > lts->nBlocksWritten)
 	{
-		PGAlignedBlock zerobuf;
+		PGIOAlignedBlock zerobuf;
 
 		MemSet(zerobuf.data, 0, sizeof(zerobuf));
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 7f3d5fc040..be6bae2b78 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
 static void
 scan_file(const char *fn, int segmentno)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..83b37a1e91 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
 local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	size_t		written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
 						size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	off_t		begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 079fbda838..b5809236f6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 {
 	int			src_fd;
 	int			dst_fd;
-	PGAlignedBlock buffer;
-	PGAlignedBlock new_vmbuf;
+	PGIOAlignedBlock buffer;
+	PGIOAlignedBlock new_vmbuf;
 	ssize_t		totalBytesRead = 0;
 	ssize_t		src_filesize;
 	int			rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d8507d88a5..83ef0609a2 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	PGIOAlignedBlock zbuffer;		/* worth BLCKSZ */
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
diff --git a/src/include/c.h b/src/include/c.h
index bd6d8e5bf5..d811181fdf 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1071,14 +1071,11 @@ extern void ExceptionalCondition(const char *conditionName,
 
 /*
  * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes.  Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware.  (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.)  We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page.  Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware.  We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf.  configure's
+ * computation of MAXIMUM_ALIGNOF).
  */
 typedef union PGAlignedBlock
 {
@@ -1087,9 +1084,30 @@ typedef union PGAlignedBlock
 	int64		force_align_i64;
 } PGAlignedBlock;
 
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function.  If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE.  This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+	char		data[BLCKSZ];
+	double		force_align_d;
+	int64		force_align_i64;
+} PGIOAlignedBlock;
+
 /* Same, but for an XLOG_BLCKSZ-sized buffer */
 typedef union PGAlignedXLogBlock
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
 	char		data[XLOG_BLCKSZ];
 	double		force_align_d;
 	int64		force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..323a4cfb4f 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,13 @@
  */
 #define PG_CACHE_LINE_SIZE		128
 
+/*
+ * Assumed memory alignment requirement for direct I/O.  On currently known
+ * systems this size applies, even for memory that is backed by larger virtual
+ * memory pages.
+ */
+#define PG_IO_ALIGN_SIZE		4096
+
 /*
  *------------------------------------------------------------------------
  * The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7144fc9f60..85ef12c440 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
  * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
  * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix).  We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
  */
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..9a77664154 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1687,6 +1687,7 @@ PGEventResultDestroy
 PGFInfoFunction
 PGFileType
 PGFunction
+PGIOAlignedBlock
 PGLZ_HistEntry
 PGLZ_Strategy
 PGMessageField
-- 
2.35.1

v3-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Add-io_direct-setting-developer-only.patchDownload
From fc1ccbbfd4a0e4c29cee8695a091fea0353b442d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v3 2/3] Add io_direct setting (developer-only).

Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files.  This hurts performance currently and is not
intended for end-users yet.  Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.

This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 33 +++++++
 src/backend/access/transam/xlog.c             | 37 ++++----
 src/backend/access/transam/xlogprefetcher.c   |  2 +-
 src/backend/storage/buffer/bufmgr.c           | 16 ++--
 src/backend/storage/buffer/localbuf.c         |  7 +-
 src/backend/storage/file/fd.c                 | 87 +++++++++++++++++++
 src/backend/storage/smgr/md.c                 | 24 +++--
 src/backend/storage/smgr/smgr.c               |  1 +
 src/backend/utils/misc/guc_tables.c           | 12 +++
 src/include/storage/fd.h                      |  8 ++
 src/include/storage/smgr.h                    |  1 +
 src/include/utils/guc_hooks.h                 |  3 +
 src/test/modules/test_misc/meson.build        |  1 +
 src/test/modules/test_misc/t/004_io_direct.pl | 40 +++++++++
 14 files changed, 238 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9eedab652d..70614d4fcc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11056,6 +11056,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-io-direct" xreflabel="io_direct">
+      <term><varname>io_direct</varname> (<type>string</type>)
+      <indexterm>
+        <primary><varname>io_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects for relation data and WAL
+        files using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+       </para>
+       <para>
+        May be set to an empty string (the default) to disable use of direct
+        I/O, or a comma-separated list of types of files for which direct I/O
+        is enabled.  The valid types of file are <literal>data</literal> for
+        main data files, <literal>wal</literal> for WAL files, and
+        <literal>wal_init</literal> for WAL files when being initially
+        allocated.
+       </para>
+       <para>
+        Some operating systems and file systems do not support direct I/O, so
+        non-default settings may be rejected at startup, or produce I/O errors
+        at runtime.
+       </para>
+       <para>
+        Currently this feature reduces performance, and is intended for
+        developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
       <term><varname>post_auth_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 172b4a2fcf..9a4f4ca711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2925,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	XLogSegNo	max_segno;
 	int			fd;
 	int			save_errno;
+	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
 
 	Assert(logtli != 0);
 
@@ -2957,8 +2958,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 
 	unlink(tmppath);
 
+	if (io_direct_flags & IO_DIRECT_WAL_INIT)
+		open_flags |= PG_O_DIRECT;
+
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = BasicOpenFile(tmppath, open_flags);
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3350,7 +3354,7 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
@@ -4441,7 +4445,6 @@ show_in_hot_standby(void)
 	return RecoveryInProgress() ? "on" : "off";
 }
 
-
 /*
  * Read the control file, set respective GUCs.
  *
@@ -8025,35 +8028,27 @@ xlog_redo(XLogReaderState *record)
 }
 
 /*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
  */
 static int
 get_sync_bit(int method)
 {
 	int			o_direct_flag = 0;
 
-	/* If fsync is disabled, never open in sync mode */
-	if (!enableFsync)
-		return 0;
-
 	/*
-	 * Optimize writes by bypassing kernel cache with O_DIRECT when using
-	 * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
-	 * otherwise the archive command or walsender process will read the WAL
-	 * soon after writing it, which is guaranteed to cause a physical read if
-	 * we bypassed the kernel cache. We also skip the
-	 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
-	 * reason.
-	 *
-	 * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+	 * Use O_DIRECT if requested, except in walreceiver process.  The WAL
 	 * written by walreceiver is normally read by the startup process soon
-	 * after it's written. Also, walreceiver performs unaligned writes, which
+	 * after it's written.  Also, walreceiver performs unaligned writes, which
 	 * don't work with O_DIRECT, so it is required for correctness too.
 	 */
-	if (!XLogIsNeeded() && !AmWalReceiverProcess())
+	if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
 		o_direct_flag = PG_O_DIRECT;
 
+	/* If fsync is disabled, never open in sync mode */
+	if (!enableFsync)
+		return o_direct_flag;
+
 	switch (method)
 	{
 			/*
@@ -8065,7 +8060,7 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
-			return 0;
+			return o_direct_flag;
 #ifdef O_SYNC
 		case SYNC_METHOD_OPEN:
 			return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..992256dd09 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				block->prefetch_buffer = InvalidBuffer;
 				return LRQ_NEXT_IO;
 			}
-			else
+			else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
 			{
 				/*
 				 * This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba07e94c9..11c8187a55 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,8 +535,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
 		 */
-		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr_reln, forkNum, blockNum))
+		{
 			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 	else
@@ -582,11 +585,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
  * the kernel and therefore didn't really initiate I/O, and no way to know when
  * the I/O completes other than using synchronous ReadBuffer().
  *
- * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and
  * USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery.  (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery.  (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
  */
 PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4911,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 {
 	PendingWriteback *pending;
 
+	if (io_direct_flags & IO_DIRECT_DATA)
+		return;
+
 	/*
 	 * Add buffer to the pending writeback array, unless writeback control is
 	 * disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 735f7c6018..b01e319641 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	{
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
-		smgrprefetch(smgr, forkNum, blockNum);
-		result.initiated_io = true;
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr, forkNum, blockNum))
+		{
+			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f6c9382023..6d1af80f9b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
+#include "utils/guc_hooks.h"
 #include "utils/resowner_private.h"
+#include "utils/varlena.h"
 
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
 
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int			io_direct_flags;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -2021,6 +2026,9 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 	if (nbytes <= 0)
 		return;
 
+	if (VfdCache[file].fileFlags & PG_O_DIRECT)
+		return;
+
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
 		return;
@@ -3737,3 +3745,82 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+	int		   *flags = guc_malloc(ERROR, sizeof(*flags));
+
+#if PG_O_DIRECT == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("io_direct is not supported on this platform.");
+		return false;
+	}
+	*flags = 0;
+#else
+	List	   *list;
+	ListCell   *l;
+
+	if (!SplitGUCList(*newval, ',', &list))
+	{
+		GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+							"io_direct");
+		return false;
+	}
+
+	*flags = 0;
+	foreach (l, list)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "data") == 0)
+			*flags |= IO_DIRECT_DATA;
+		else if (pg_strcasecmp(item, "wal") == 0)
+			*flags |= IO_DIRECT_WAL;
+		else if (pg_strcasecmp(item, "wal_init") == 0)
+			*flags |= IO_DIRECT_WAL_INIT;
+		else
+		{
+			GUC_check_errdetail("invalid option \"%s\"", item);
+			return false;
+		}
+	}
+#endif
+
+	*extra = flags;
+
+	return true;
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+	int	   *flags = (int *) extra;
+
+	io_direct_flags = *flags;
+}
+
+extern const char *
+show_io_direct(void)
+{
+	static char result[80];
+
+	result[0] = 0;
+	if (io_direct_flags & IO_DIRECT_DATA)
+		strcat(result, "data");
+	if (io_direct_flags & IO_DIRECT_WAL)
+	{
+		if (result[0])
+			strcat(result, ", ");
+		strcat(result, "wal");
+	}
+	if (io_direct_flags & IO_DIRECT_WAL_INIT)
+	{
+		if (result[0])
+			strcat(result, ", ");
+		strcat(result, "wal_init");
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3e034afdf1..38263f3d0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+	int		flags = O_RDWR | PG_BINARY;
+
+	if (io_direct_flags & IO_DIRECT_DATA)
+		flags |= PG_O_DIRECT;
+
+	return flags;
+}
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -527,7 +537,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 
 	if (fd < 0)
 	{
@@ -598,6 +608,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	off_t		seekpos;
 	MdfdVec    *v;
 
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 	if (v == NULL)
@@ -623,6 +635,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	/*
 	 * Issue flush requests in as few requests as possible; have to split at
 	 * segment boundaries though, since those are actually separate files.
@@ -1200,7 +1214,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
 
 	pfree(fullpath);
 
@@ -1410,7 +1424,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		strlcpy(path, p, MAXPGPATH);
 		pfree(p);
 
-		file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+		file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
 		if (file < 0)
 			return -1;
 		need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..4892920812 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 436afe1d21..e12bb15669 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -543,6 +543,7 @@ static char *locale_ctype;
 static char *server_encoding_string;
 static char *server_version_string;
 static int	server_version_num;
+static char *io_direct_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4483,6 +4484,17 @@ struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Use direct I/O for file access."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&io_direct_string,
+		"",
+		check_io_direct, assign_io_direct, show_io_direct
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 85ef12c440..0d65cb3c80 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,8 @@
 #define FD_H
 
 #include <dirent.h>
+#include <fcntl.h>
+
 
 typedef enum RecoveryInitSyncMethod
 {
@@ -54,10 +56,16 @@ typedef enum RecoveryInitSyncMethod
 typedef int File;
 
 
+#define IO_DIRECT_DATA			0x01
+#define IO_DIRECT_WAL			0x02
+#define IO_DIRECT_WAL_INIT		0x04
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ea7b3ff8dd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..61a7fd77b8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -154,5 +154,8 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
 										   GucSource source);
 extern void assign_wal_consistency_checking(const char *newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
+extern const char *show_io_direct(void);
 
 #endif							/* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index b7478c3125..bbed7093d0 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
+      't/004_io_direct.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..803fb334e7
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,40 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error.  (illumos should
+# probably be in this list, but perl reports it as solaris.  Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it, and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+	plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;');
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
-- 
2.35.1

v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From c2f343aea63f2837fa22047ef3298bddf443646a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v3 3/3] XXX turn on direct I/O by default, just for CI

---
 src/backend/utils/misc/guc_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e12bb15669..762e9f8590 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4491,7 +4491,7 @@ struct config_string ConfigureNamesString[] =
 			GUC_NOT_IN_SAMPLE
 		},
 		&io_direct_string,
-		"",
+		"data,wal,wal_init",
 		check_io_direct, assign_io_direct, show_io_direct
 	},
 
-- 
2.35.1

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#11)
1 attachment(s)
Re: Direct I/O

On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.

Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:

1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.

I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Some comments on the tests added:

1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'),
"read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'),
"read back from local");
+$node->stop('immediate');
2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');

4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Review-comments-io_direct-GUC.txttext/plain; charset=US-ASCII; name=v1-0001-Review-comments-io_direct-GUC.txtDownload
From b3eed3d6fc849b9e16fbace1f37d401424f81ab0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 25 Jan 2023 07:18:11 +0000
Subject: [PATCH v1] Review comments io_direct GUC

---
 src/backend/storage/file/fd.c       | 57 ++++++++++++-----------------
 src/backend/utils/misc/guc_tables.c |  4 +-
 src/include/utils/guc_hooks.h       |  1 -
 3 files changed, 25 insertions(+), 37 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index eb83de4fb9..329acc2ffd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3749,7 +3749,7 @@ data_sync_elevel(int elevel)
 bool
 check_io_direct(char **newval, void **extra, GucSource source)
 {
-	int		   *flags = guc_malloc(ERROR, sizeof(*flags));
+	int		flags;
 
 #if PG_O_DIRECT == 0
 	if (strcmp(*newval, "") != 0)
@@ -3757,38 +3757,51 @@ check_io_direct(char **newval, void **extra, GucSource source)
 		GUC_check_errdetail("io_direct is not supported on this platform.");
 		return false;
 	}
-	*flags = 0;
+	flags = 0;
 #else
-	List	   *list;
+	List	   *elemlist;
 	ListCell   *l;
+	char	   *rawstring;
 
-	if (!SplitGUCList(*newval, ',', &list))
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
 	{
 		GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
 							"io_direct");
+		pfree(rawstring);
+		list_free(elemlist);
 		return false;
 	}
 
-	*flags = 0;
-	foreach (l, list)
+	flags = 0;
+	foreach (l, elemlist)
 	{
 		char	   *item = (char *) lfirst(l);
 
 		if (pg_strcasecmp(item, "data") == 0)
-			*flags |= IO_DIRECT_DATA;
+			flags |= IO_DIRECT_DATA;
 		else if (pg_strcasecmp(item, "wal") == 0)
-			*flags |= IO_DIRECT_WAL;
+			flags |= IO_DIRECT_WAL;
 		else if (pg_strcasecmp(item, "wal_init") == 0)
-			*flags |= IO_DIRECT_WAL_INIT;
+			flags |= IO_DIRECT_WAL_INIT;
 		else
 		{
 			GUC_check_errdetail("invalid option \"%s\"", item);
+			pfree(rawstring);
+			list_free(elemlist);
 			return false;
 		}
 	}
+
+	pfree(rawstring);
+	list_free(elemlist);
 #endif
 
-	*extra = flags;
+	/* Save the flags in *extra, for use by assign_io_direct */
+	*extra = guc_malloc(ERROR, sizeof(int));
+	*((int *) *extra) = flags;
 
 	return true;
 }
@@ -3800,27 +3813,3 @@ assign_io_direct(const char *newval, void *extra)
 
 	io_direct_flags = *flags;
 }
-
-extern const char *
-show_io_direct(void)
-{
-	static char result[80];
-
-	result[0] = 0;
-	if (io_direct_flags & IO_DIRECT_DATA)
-		strcat(result, "data");
-	if (io_direct_flags & IO_DIRECT_WAL)
-	{
-		if (result[0])
-			strcat(result, ", ");
-		strcat(result, "wal");
-	}
-	if (io_direct_flags & IO_DIRECT_WAL_INIT)
-	{
-		if (result[0])
-			strcat(result, ", ");
-		strcat(result, "wal_init");
-	}
-
-	return result;
-}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9410493ae7..25b7e87abb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4529,11 +4529,11 @@ struct config_string ConfigureNamesString[] =
 		{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Use direct I/O for file access."),
 			NULL,
-			GUC_NOT_IN_SAMPLE
+			GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
 		},
 		&io_direct_string,
 		"",
-		check_io_direct, assign_io_direct, show_io_direct
+		check_io_direct, assign_io_direct, NULL
 	},
 
 	/* End-of-list marker */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index e656a16a40..b3b5148185 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -156,6 +156,5 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 extern bool check_io_direct(char **newval, void **extra, GucSource source);
 extern void assign_io_direct(const char *newval, void *extra);
-extern const char *show_io_direct(void);
 
 #endif							/* GUC_HOOKS_H */
-- 
2.34.1

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#12)
3 attachment(s)
Re: Direct I/O

On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:

1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.

Thanks for looking at this, and sorry for the late response. Yeah, agreed.

2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.

I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Thanks. Your way is nicer. I merged your patch and added you as a co-author.

Some comments on the tests added:

1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.

Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this. I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]: https://twitter.com/MengTangmu/status/994770040745615361

Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT). (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2]http://kos.enix.org/pub/gingell8.pdf, but every
(?) Unixoid OS copied it anyway, with variations... Apparently the
Windows thing is called VirtualQuery()).

2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

OK, sure, done. And also oops, that was completely wrong and not
working the way I had it in that version...

3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');

OK.

4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();

Sure, but why not?

Otherwise, I rebased, and made a couple more changes:

I found a line of the manual about wal_sync_method that needed to be removed:

- The <literal>open_</literal>* options also use
<literal>O_DIRECT</literal> if available.

In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it). I think we should adjust that
misleading sentence in back-branches, separately from this patch set.

I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.

Also a few minor comment changes.

[1]: https://twitter.com/MengTangmu/status/994770040745615361
[2]: http://kos.enix.org/pub/gingell8.pdf

Attachments:

v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From c6e01d506762fb7c11a3fb31d56902fa53ea822b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v4 1/3] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

In order to be able to use O_DIRECT/FILE_FLAG_NO_BUFFERING on common
systems in a later commit, we need the address and length of user space
buffers to align with the sector size of the storage.  O_DIRECT would
either fail to work or fail to work efficiently without that on various
platforms.  Even without O_DIRECT, aligning on memory pages is known to
improve traditional buffered I/O performance.

The alignment size is set to 4096, which is enough for currently known
systems: it covers traditional 512 byte sectors and modern 4096 byte
sectors, as well as common 4096 byte memory pages.  There is no standard
governing the requirements for O_DIRECT so it's possible we might have
to reconsider this approach or fail to work on some exotic system, but
for now this simplistic approach works and it can be changed at compile
time.

Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  (3) The main buffer
pool is also aligned in shared memory.

If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0.  (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)

Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
 contrib/bloom/blinsert.c                  |  2 +-
 contrib/pg_prewarm/pg_prewarm.c           |  2 +-
 src/backend/access/gist/gistbuild.c       |  9 +++---
 src/backend/access/hash/hashpage.c        |  2 +-
 src/backend/access/heap/rewriteheap.c     |  2 +-
 src/backend/access/nbtree/nbtree.c        |  2 +-
 src/backend/access/nbtree/nbtsort.c       |  8 ++++--
 src/backend/access/spgist/spginsert.c     |  2 +-
 src/backend/access/transam/generic_xlog.c | 13 ++++++---
 src/backend/access/transam/xlog.c         |  9 +++---
 src/backend/catalog/storage.c             |  2 +-
 src/backend/storage/buffer/buf_init.c     | 10 +++++--
 src/backend/storage/buffer/bufmgr.c       |  2 +-
 src/backend/storage/buffer/localbuf.c     |  7 +++--
 src/backend/storage/file/buffile.c        |  6 ++++
 src/backend/storage/page/bufpage.c        |  5 +++-
 src/backend/storage/smgr/md.c             | 15 +++++++++-
 src/backend/utils/sort/logtape.c          |  2 +-
 src/bin/pg_checksums/pg_checksums.c       |  2 +-
 src/bin/pg_rewind/local_source.c          |  4 +--
 src/bin/pg_upgrade/file.c                 |  4 +--
 src/common/file_utils.c                   |  4 +--
 src/include/c.h                           | 34 +++++++++++++++++------
 src/include/pg_config_manual.h            |  6 ++++
 src/include/storage/fd.h                  |  5 ++--
 src/tools/pgindent/typedefs.list          |  1 +
 26 files changed, 112 insertions(+), 48 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dcd8120895..b42b9e6c41 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	BloomFillMetapage(index, metapage);
 
 	/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 54209924ae..e464d0d4d2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
 	PREWARM_BUFFER
 } PrewarmType;
 
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d2f8da5b02..5e0c1447f9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
-	page = palloc0(BLCKSZ);
+	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
 			   page, true);
 	state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+			levelstate->pages[levelstate->current_page] =
+				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc0(BLCKSZ);
+		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc(BLCKSZ);
+			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6d8af42260..af3a154266 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
 _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
-	PGAlignedBlock zerobuf;
+	PGIOAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ae0282a70e..424958912c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc(BLCKSZ);
+	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
 	state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 992f84834f..2df8849858 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -154,7 +154,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 
 	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1207a49689..6ad3f3c54d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	while (blkno > wstate->btws_pages_written)
 	{
 		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+														  PG_IO_ALIGN_SIZE,
+														  MCXT_ALLOC_ZERO);
 		/* don't set checksum for all-zero page */
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
+	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 718a88335d..72d2e1551c 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
 	Page		page;
 
 	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
+	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
 	SpGistInitMetapage(page);
 
 	/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 9f67d1c1cd..6c68191ca6 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
 	char		delta[MAX_DELTA_SIZE];	/* delta between page images */
 } PageData;
 
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction.  Must be allocated at an I/O
+ * aligned address.
+ */
 struct GenericXLogState
 {
+	/* Page images (properly aligned, must be first) */
+	PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 	/* Info about each page, see above */
 	PageData	pages[MAX_GENERIC_XLOG_PAGES];
 	bool		isLogged;
-	/* Page images (properly aligned) */
-	PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
 };
 
 static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
 	GenericXLogState *state;
 	int			i;
 
-	state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+	state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+												PG_IO_ALIGN_SIZE,
+												0);
 	state->isLogged = RelationNeedsWAL(relation);
 
 	for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..3fea8c4082 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4506,7 +4506,7 @@ XLOGShmemSize(void)
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
+	size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
 	/* and the buffers themselves */
 	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
@@ -4603,10 +4603,11 @@ XLOGShmemInit(void)
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * This simplifies some calculations in XLOG insertion.  We also need I/O
+	 * alignment for O_DIRECT, but that's also a power of two and usually
+	 * smaller.  Take the larger of the two alignment requirements.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+	allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
 	XLogCtl->pages = allocptr;
 	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index af1491aa1d..2add053489 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 20946c47cb..0057443f0c 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
 						NBuffers * sizeof(BufferDescPadded),
 						&foundDescs);
 
+	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		ShmemInitStruct("Buffer Blocks",
-						NBuffers * (Size) BLCKSZ, &foundBufs);
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStruct("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
 	/* to allow aligning buffer descriptors */
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 
-	/* size of data pages */
+	/* size of data pages, plus alignment padding */
+	size = add_size(size, PG_IO_ALIGN_SIZE);
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 908a8934bd..033f230b1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4261,7 +4261,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bool		use_wal;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3846d3eaca..aae02949ce 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -735,8 +735,11 @@ GetLocalBufferStorage(void)
 		/* And don't overflow MaxAllocSize, either */
 		num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
 
-		cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
-												num_bufs * BLCKSZ);
+		/* Buffers should be I/O aligned. */
+		cur_block = (char *)
+			TYPEALIGN(PG_IO_ALIGN_SIZE,
+					  MemoryContextAlloc(LocalBufferContext,
+										 num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
 		next_buf_in_block = 0;
 		num_bufs_in_block = num_bufs;
 	}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 37ea8ac6b7..84ead85942 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
 	off_t		curOffset;		/* offset part of current pos */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+	 * wasting per-file alignment padding when some users create many
+	 * files.
+	 */
 	PGAlignedBlock buffer;
 };
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 92994f8f39..9a302ddc30 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
 	 * and second to avoid wasting space in processes that never call this.
 	 */
 	if (pageCopy == NULL)
-		pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+											 BLCKSZ,
+											 PG_IO_ALIGN_SIZE,
+											 0);
 
 	memcpy(pageCopy, (char *) page, BLCKSZ);
 	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1c2d1405f8..efa9773a4d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum >= mdnblocks(reln, forknum));
@@ -783,6 +787,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
 										reln->smgr_rlocator.locator.spcOid,
 										reln->smgr_rlocator.locator.dbOid,
@@ -848,6 +856,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	int			nbytes;
 	MdfdVec    *v;
 
+	/* If this build supports direct I/O, the buffer must be I/O aligned. */
+	if (PG_O_DIRECT != 0)
+		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
@@ -1424,7 +1436,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 */
 			if (nblocks < ((BlockNumber) RELSEG_SIZE))
 			{
-				char	   *zerobuf = palloc0(BLCKSZ);
+				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+													 MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 64ea237438..52b8898d5e 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, const void *buffer)
 	 */
 	while (blocknum > lts->nBlocksWritten)
 	{
-		PGAlignedBlock zerobuf;
+		PGIOAlignedBlock zerobuf;
 
 		MemSet(zerobuf.data, 0, sizeof(zerobuf));
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index aa21007497..19eb67e485 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
 static void
 scan_file(const char *fn, int segmentno)
 {
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index da9d75dccb..4e2a1376c6 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
 local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	size_t		written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
 						size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
-	PGAlignedBlock buf;
+	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
 	off_t		begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index ed874507ff..d173602882 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 {
 	int			src_fd;
 	int			dst_fd;
-	PGAlignedBlock buffer;
-	PGAlignedBlock new_vmbuf;
+	PGIOAlignedBlock buffer;
+	PGIOAlignedBlock new_vmbuf;
 	ssize_t		totalBytesRead = 0;
 	ssize_t		src_filesize;
 	int			rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d568d83b9f..74833c4acb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -540,8 +540,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size, off_t offset)
 {
-	static const PGAlignedBlock zbuffer = {{0}};	/* worth BLCKSZ */
-	void	   *zerobuf_addr = unconstify(PGAlignedBlock *, &zbuffer)->data;
+	static const PGIOAlignedBlock zbuffer = {{0}};	/* worth BLCKSZ */
+	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
 	struct iovec iov[PG_IOV_MAX];
 	size_t		remaining_size = size;
 	ssize_t		total_written = 0;
diff --git a/src/include/c.h b/src/include/c.h
index 5fe7a97ff0..f69d739be5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1119,14 +1119,11 @@ extern void ExceptionalCondition(const char *conditionName,
 
 /*
  * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes.  Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware.  (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.)  We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page.  Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware.  We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf. configure's
+ * computation of MAXIMUM_ALIGNOF).
  */
 typedef union PGAlignedBlock
 {
@@ -1135,9 +1132,30 @@ typedef union PGAlignedBlock
 	int64		force_align_i64;
 } PGAlignedBlock;
 
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function.  If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE.  This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+	char		data[BLCKSZ];
+	double		force_align_d;
+	int64		force_align_i64;
+} PGIOAlignedBlock;
+
 /* Same, but for an XLOG_BLCKSZ-sized buffer */
 typedef union PGAlignedXLogBlock
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
 	char		data[XLOG_BLCKSZ];
 	double		force_align_d;
 	int64		force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index b586ee269a..c799bc2013 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,12 @@
  */
 #define PG_CACHE_LINE_SIZE		128
 
+/*
+ * Assumed alignment requirement for direct I/O.  4K corresponds to sector size
+ * on modern storage, and works also for older 512 byte sectors.
+ */
+#define PG_IO_ALIGN_SIZE		4096
+
 /*
  *------------------------------------------------------------------------
  * The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index daceafd473..faac4914fe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
  * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
  * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix).  We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
  */
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3219ea5f05..0313b2c93a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1703,6 +1703,7 @@ PGEventResultDestroy
 PGFInfoFunction
 PGFileType
 PGFunction
+PGIOAlignedBlock
 PGLZ_HistEntry
 PGLZ_Strategy
 PGLoadBalanceType
-- 
2.39.2

v4-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Add-io_direct-setting-developer-only.patchDownload
From 3db83b7289b85d1c84c5490e1d43e378b5ed3053 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v4 2/3] Add io_direct setting (developer-only).

Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files.  This hurts performance currently and is not
intended for end-users yet.  Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.

The only user-visible change, if the developer-only GUC is not used, is
that this commit also removes the obscure logic that would activate
O_DIRECT for the WAL when wal_sync_method=open_[data]sync and
wal_level=minimal (which also requires max_wal_senders=0).  Those are
non-default and unlikely settings, and this behavior wasn't (correctly)
documented.  In the unlikely event that a user wants that functionality
back, io_direct=wal is a more direct way to say so.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 34 ++++++++-
 src/backend/access/transam/xlog.c             | 37 ++++-----
 src/backend/access/transam/xlogprefetcher.c   |  2 +-
 src/backend/storage/buffer/bufmgr.c           | 16 ++--
 src/backend/storage/buffer/localbuf.c         |  7 +-
 src/backend/storage/file/fd.c                 | 76 +++++++++++++++++++
 src/backend/storage/smgr/md.c                 | 24 ++++--
 src/backend/storage/smgr/smgr.c               |  1 +
 src/backend/utils/misc/guc_tables.c           | 12 +++
 src/include/storage/fd.h                      |  7 ++
 src/include/storage/smgr.h                    |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/test/modules/test_misc/meson.build        |  1 +
 src/test/modules/test_misc/t/004_io_direct.pl | 48 ++++++++++++
 14 files changed, 233 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 25111d5caf..fc885c43a8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3155,7 +3155,6 @@ include_dir 'conf.d'
         </listitem>
        </itemizedlist>
        <para>
-        The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available.
         Not all of these choices are available on all platforms.
         The default is the first method in the above list that is supported
         by the platform, except that <literal>fdatasync</literal> is the default on
@@ -11241,6 +11240,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-io-direct" xreflabel="io_direct">
+      <term><varname>io_direct</varname> (<type>string</type>)
+      <indexterm>
+        <primary><varname>io_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects for relation data and WAL
+        files using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+       </para>
+       <para>
+        May be set to an empty string (the default) to disable use of direct
+        I/O, or a comma-separated list of types of files for which direct I/O
+        is enabled.  The valid types of file are <literal>data</literal> for
+        main data files, <literal>wal</literal> for WAL files, and
+        <literal>wal_init</literal> for WAL files when being initially
+        allocated.
+       </para>
+       <para>
+        Some operating systems and file systems do not support direct I/O, so
+        non-default settings may be rejected at startup, or produce I/O errors
+        at runtime.
+       </para>
+       <para>
+        Currently this feature reduces performance, and is intended for
+        developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
       <term><varname>post_auth_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3fea8c4082..7a555d8701 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2926,6 +2926,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	XLogSegNo	max_segno;
 	int			fd;
 	int			save_errno;
+	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
 
 	Assert(logtli != 0);
 
@@ -2959,8 +2960,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 
 	unlink(tmppath);
 
+	if (io_direct_flags & IO_DIRECT_WAL_INIT)
+		open_flags |= PG_O_DIRECT;
+
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = BasicOpenFile(tmppath, open_flags);
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3354,7 +3358,7 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
@@ -4445,7 +4449,6 @@ show_in_hot_standby(void)
 	return RecoveryInProgress() ? "on" : "off";
 }
 
-
 /*
  * Read the control file, set respective GUCs.
  *
@@ -8030,35 +8033,27 @@ xlog_redo(XLogReaderState *record)
 }
 
 /*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
  */
 static int
 get_sync_bit(int method)
 {
 	int			o_direct_flag = 0;
 
-	/* If fsync is disabled, never open in sync mode */
-	if (!enableFsync)
-		return 0;
-
 	/*
-	 * Optimize writes by bypassing kernel cache with O_DIRECT when using
-	 * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
-	 * otherwise the archive command or walsender process will read the WAL
-	 * soon after writing it, which is guaranteed to cause a physical read if
-	 * we bypassed the kernel cache. We also skip the
-	 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
-	 * reason.
-	 *
-	 * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+	 * Use O_DIRECT if requested, except in walreceiver process.  The WAL
 	 * written by walreceiver is normally read by the startup process soon
-	 * after it's written. Also, walreceiver performs unaligned writes, which
+	 * after it's written.  Also, walreceiver performs unaligned writes, which
 	 * don't work with O_DIRECT, so it is required for correctness too.
 	 */
-	if (!XLogIsNeeded() && !AmWalReceiverProcess())
+	if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
 		o_direct_flag = PG_O_DIRECT;
 
+	/* If fsync is disabled, never open in sync mode */
+	if (!enableFsync)
+		return o_direct_flag;
+
 	switch (method)
 	{
 			/*
@@ -8070,7 +8065,7 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
-			return 0;
+			return o_direct_flag;
 #ifdef O_SYNC
 		case SYNC_METHOD_OPEN:
 			return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 046e40d143..7ba18f2a76 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				block->prefetch_buffer = InvalidBuffer;
 				return LRQ_NEXT_IO;
 			}
-			else
+			else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
 			{
 				/*
 				 * This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 033f230b1d..5cb026d1ca 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -541,8 +541,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
 		 */
-		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr_reln, forkNum, blockNum))
+		{
 			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 	else
@@ -588,11 +591,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
  * the kernel and therefore didn't really initiate I/O, and no way to know when
  * the I/O completes other than using synchronous ReadBuffer().
  *
- * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and
  * USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery.  (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery.  (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
  */
 PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -5451,6 +5454,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 {
 	PendingWriteback *pending;
 
+	if (io_direct_flags & IO_DIRECT_DATA)
+		return;
+
 	/*
 	 * Add buffer to the pending writeback array, unless writeback control is
 	 * disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index aae02949ce..c6384c9fde 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -92,8 +92,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	{
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
-		smgrprefetch(smgr, forkNum, blockNum);
-		result.initiated_io = true;
+		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			smgrprefetch(smgr, forkNum, blockNum))
+		{
+			result.initiated_io = true;
+		}
 #endif							/* USE_PREFETCH */
 	}
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a280a1e7be..ccc789dc03 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
+#include "utils/guc_hooks.h"
 #include "utils/resowner_private.h"
+#include "utils/varlena.h"
 
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
 
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int			io_direct_flags;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -2022,6 +2027,9 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 	if (nbytes <= 0)
 		return;
 
+	if (VfdCache[file].fileFlags & PG_O_DIRECT)
+		return;
+
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
 		return;
@@ -3826,3 +3834,71 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+	int		flags;
+
+#if PG_O_DIRECT == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("io_direct is not supported on this platform.");
+		return false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+							"io_direct");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach (l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "data") == 0)
+			flags |= IO_DIRECT_DATA;
+		else if (pg_strcasecmp(item, "wal") == 0)
+			flags |= IO_DIRECT_WAL;
+		else if (pg_strcasecmp(item, "wal_init") == 0)
+			flags |= IO_DIRECT_WAL_INIT;
+		else
+		{
+			GUC_check_errdetail("invalid option \"%s\"", item);
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	/* Save the flags in *extra, for use by assign_io_direct */
+	*extra = guc_malloc(ERROR, sizeof(int));
+	*((int *) *extra) = flags;
+
+	return true;
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+	int	   *flags = (int *) extra;
+
+	io_direct_flags = *flags;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index efa9773a4d..5647abeffd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+	int		flags = O_RDWR | PG_BINARY;
+
+	if (io_direct_flags & IO_DIRECT_DATA)
+		flags |= PG_O_DIRECT;
+
+	return flags;
+}
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -635,7 +645,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
 
 	if (fd < 0)
 	{
@@ -706,6 +716,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	off_t		seekpos;
 	MdfdVec    *v;
 
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 	if (v == NULL)
@@ -731,6 +743,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
 	/*
 	 * Issue flush requests in as few requests as possible; have to split at
 	 * segment boundaries though, since those are actually separate files.
@@ -1330,7 +1344,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
 
 	pfree(fullpath);
 
@@ -1540,7 +1554,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		strlcpy(path, p, MAXPGPATH);
 		pfree(p);
 
-		file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+		file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
 		if (file < 0)
 			return -1;
 		need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c37c246b77..70d0d570b1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e8e8245e91..d3ed527e3b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -568,6 +568,7 @@ static char *locale_ctype;
 static char *server_encoding_string;
 static char *server_version_string;
 static int	server_version_num;
+static char *io_direct_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4565,6 +4566,17 @@ struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Use direct I/O for file access."),
+			NULL,
+			GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
+		},
+		&io_direct_string,
+		"",
+		check_io_direct, assign_io_direct, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index faac4914fe..6791a406fc 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,7 @@
 #define FD_H
 
 #include <dirent.h>
+#include <fcntl.h>
 
 typedef enum RecoveryInitSyncMethod
 {
@@ -54,10 +55,16 @@ typedef enum RecoveryInitSyncMethod
 typedef int File;
 
 
+#define IO_DIRECT_DATA			0x01
+#define IO_DIRECT_WAL			0x02
+#define IO_DIRECT_WAL_INIT		0x04
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..17fba6f91a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f722fb250a..a82a85c940 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -156,5 +156,7 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
 										   GucSource source);
 extern void assign_wal_consistency_checking(const char *newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
 
 #endif							/* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 21bde427b4..911084ac0f 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
+      't/004_io_direct.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..78646e945e
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,48 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error.  (illumos should
+# probably be in this list, but perl reports it as solaris.  Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it, and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+	plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+io_direct = 'data,wal,wal_init'
+shared_buffers = '256kB' # tiny to force I/O
+});
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', qq{
+begin;
+create temporary table t2 as select 1 as i from generate_series(1, 10000);
+update t2 set i = i;
+insert into t2count select count(*) from t2;
+commit;
+});
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
-- 
2.39.2

v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From f5318b888ad14f4f88ccd71511c64b3b990d939b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v4 3/3] XXX turn on direct I/O by default, just for CI

---
 src/backend/utils/misc/guc_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d3ed527e3b..2f95b86e19 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4573,7 +4573,7 @@ struct config_string ConfigureNamesString[] =
 			GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
 		},
 		&io_direct_string,
-		"",
+		"data,wal,wal_init",
 		check_io_direct, assign_io_direct, NULL
 	},
 
-- 
2.39.2

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#13)
Re: Direct I/O

I did some testing with non-default block sizes, and found a few minor
things that needed adjustment. The short version is that I blocked
some configurations that won't work or would break an assertion.
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

The adjustments were:

1. If you set your BLCKSZ or XLOG_BLCKSZ smaller than
PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for
the relevant operations, because such undersized direct I/Os will fail
on common systems.

FATAL: invalid value for parameter "io_direct": "wal"
DETAIL: io_direct is not supported for WAL because XLOG_BLCKSZ is too small

FATAL: invalid value for parameter "io_direct": "data"
DETAIL: io_direct is not supported for data because BLCKSZ is too small

In fact some systems would be OK with it if the true requirement is
512 not 4096, but (1) tiny blocks are a niche build option that
doesn't even pass regression tests and (2) it's hard and totally
unportable to find out the true requirement at runtime, and (3) the
conservative choice of 4096 has additional benefits by matching memory
pages. So I think a conservative compile-time number is a good
starting position.

2. Previously I had changed the WAL buffer alignment to be the larger
of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above
thinking, I reverted that part (no point in aligning the address of
the buffer when the size is too small for direct I/O, but now that
combination is blocked off at GUC level so we don't need any change
here).

3. I updated the md.c alignment assertions to allow for tiny blocks.
The point of these assertions is to fail if any new code does I/O from
badly aligned buffers even with io_direct turned off (ie how most
people hack), 'cause that will fail with io_direct turned on. The
change is that I don't make the assertion if you're using BLCKSZ <
PG_IO_ALIGN_SIZE. Such buffers wouldn't work if used for direct I/O
but that's OK, the GUC won't allow it.

4. I made the language to explain where PG_IO_ALIGN_SIZE really comes
from a little vaguer because it's complex.

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
Re: Direct I/O

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#14)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

I did some testing with non-default block sizes, and found a few minor
things that needed adjustment.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&amp;dt=2023-04-08%2004%3A42%3A04

This seems like another thing that should not have been pushed mere
hours before feature freeze.

regards, tom lane

#17Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#15)
Re: Direct I/O

Hi,

On 2023-04-08 16:59:20 +1200, Thomas Munro wrote:

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

There were some failures in CI (e.g. [1]https://cirrus-ci.com/task/4519721039560704 (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

#4 0x00005624dfe90336 in errfinish (filename=0x5624df6867c0 "../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c", lineno=353,
funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at ../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604
#5 0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353
#6 0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601
#7 0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, strategy=0x0, foundPtr=0x7ffd418214a3,
io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290
#8 0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0,
hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056
#9 0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776
#10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45, page_std=false)
at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290
#11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false)
at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837
#12 0x00005624df8d1dd2 in CommitTransaction () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225
#13 0x00005624df8d2da2 in CommitTransactionCommand () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060
#14 0x00005624dfcbe0a1 in finish_xact_command () at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779
#15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from generate_series(1, 10000)")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299
#16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623
#17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
#18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
#19 0x00005624dfbe8541 in ServerLoop () at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
#20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463
#21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/main/main.c:200

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

Greetings,

Andres Freund

[1]: https://cirrus-ci.com/task/4519721039560704

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#15)
Re: Direct I/O

On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

No particular ideas about what happened there yet. It *looks* like we
wrote out a page, and then read it back in very soon afterwards, all
via the usual locked bufmgr/smgr pathways, and it failed basic page
validation. The reader was a parallel worker, because of the
debug_parallel_mode setting on that box. The page number looks
reasonable (I guess it's reading a page created by the UPDATE full of
new tuples, but I don't know which process wrote it). It's also not
immediately obvious how this could be connected to the 32 pinned
buffer problem (all that would have happened in the CREATE TABLE
process which ended already before the UPDATE and then the SELECT
backends even started).

Andrew, what file system and type of disk is that machine using?

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
1 attachment(s)
Re: Direct I/O

Hi,

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:

There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

[backtrace]

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

Here's a quick prototype of this approach. If we expose LimitAdditionalPins(),
we'd probably want to add "Buffer" to the name, and pass it a relation, so
that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite
in question doesn't need it, but ...

Without the limiting of pins the modified 004_io_direct.pl fails 100% of the
time for me.

Presumably the reason it fails occasionally with 256kB of shared buffers
(i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single
buffer. As log_newpage_range() thinks it can just pin 32 buffers
unconditionally, it fails in that case.

Greetings,

Andres Freund

Attachments:

limit-pins.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/storage/bufmgr.h w/src/include/storage/bufmgr.h
index 6ab00daa2ea..e5788309c86 100644
--- i/src/include/storage/bufmgr.h
+++ w/src/include/storage/bufmgr.h
@@ -223,6 +223,7 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
 extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
 									int nlocators);
 extern void DropDatabaseBuffers(Oid dbid);
+extern void LimitAdditionalPins(uint32 *additional_pins);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
diff --git i/src/backend/access/transam/xloginsert.c w/src/backend/access/transam/xloginsert.c
index e2a5a3d13ba..2189fc9f71f 100644
--- i/src/backend/access/transam/xloginsert.c
+++ w/src/backend/access/transam/xloginsert.c
@@ -1268,8 +1268,8 @@ log_newpage_range(Relation rel, ForkNumber forknum,
 
 	/*
 	 * Iterate over all the pages in the range. They are collected into
-	 * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
-	 * for each batch.
+	 * batches of up to XLR_MAX_BLOCK_ID pages, and a single WAL-record is
+	 * written for each batch.
 	 */
 	XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);
 
@@ -1278,14 +1278,18 @@ log_newpage_range(Relation rel, ForkNumber forknum,
 	{
 		Buffer		bufpack[XLR_MAX_BLOCK_ID];
 		XLogRecPtr	recptr;
-		int			nbufs;
+		uint32		limit = XLR_MAX_BLOCK_ID;
+		int		nbufs;
 		int			i;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/* avoid running out pinnable buffers */
+		LimitAdditionalPins(&limit);
+
 		/* Collect a batch of blocks. */
 		nbufs = 0;
-		while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
+		while (nbufs < limit && blkno < endblk)
 		{
 			Buffer		buf = ReadBufferExtended(rel, forknum, blkno,
 												 RBM_NORMAL, NULL);
diff --git i/src/backend/storage/buffer/bufmgr.c w/src/backend/storage/buffer/bufmgr.c
index 7778dde3e57..31c75d6240e 100644
--- i/src/backend/storage/buffer/bufmgr.c
+++ w/src/backend/storage/buffer/bufmgr.c
@@ -1742,7 +1742,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
diff --git i/src/test/modules/test_misc/t/004_io_direct.pl w/src/test/modules/test_misc/t/004_io_direct.pl
index f5bf0b11e4e..5791c2ab7bd 100644
--- i/src/test/modules/test_misc/t/004_io_direct.pl
+++ w/src/test/modules/test_misc/t/004_io_direct.pl
@@ -22,7 +22,7 @@ $node->init;
 $node->append_conf(
 	'postgresql.conf', qq{
 io_direct = 'data,wal,wal_init'
-shared_buffers = '256kB' # tiny to force I/O
+shared_buffers = '16' # tiny to force I/O
 });
 $node->start;
 
#20Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
Re: Direct I/O

Hi,

Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().

Greetings,

Andres Freund

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#20)
Re: Direct I/O

On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:

Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().

Will do.

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#21)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:

Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().

Will do.

Now crake is doing this:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1

The fact that the error is happening in a parallel worker seems
interesting ...

(BTW, why are the log lines doubly timestamped?)

regards, tom lane

#23Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#22)
Re: Direct I/O

On Sun, Apr 9, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384

The fact that the error is happening in a parallel worker seems
interesting ...

That's because it's running with debug_parallel_query=regress. I've
been trying to repro that but no luck... A different kind of failure
also showed up, where it counted the wrong number of tuples:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&amp;dt=2023-04-08%2015%3A52%3A03

A paranoid explanation would be that this system is failing to provide
basic I/O coherency, we're writing pages out and not reading them back
in. Or of course there is a dumb bug... but why only here? Can of
course be timing-sensitive and it's interesting that crake suffers
from the "no unpinned buffers available" thing (which should now be
gone) with higher frequency; I'm keen to see if the dodgy-read problem
continues with a similar frequency now.

#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
Re: Direct I/O

Hi,

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:
Now crake is doing this:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1

The fact that the error is happening in a parallel worker seems
interesting ...

There were a few prior instances of that error. One that I hadn't seen before
is this:

[11:35:07.190](0.001s) # Failed test 'read back from shared'
# at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) # got: '10000'
# expected: '10098'

For one it points to the arguments to is() being switched around, but that's a
sideshow.

(BTW, why are the log lines doubly timestamped?)

It's odd.

It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.

I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.

Greetings,

Andres Freund

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#24)
Re: Direct I/O

Andres Freund <andres@anarazel.de> writes:

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:

(BTW, why are the log lines doubly timestamped?)

It's odd.

Oh, I guess that's intentional, because crake has

'log_line_prefix = \'%m [%s %p:%l] %q%a \'',

It's also odd that it's just crake having the issue. It's just a linux host,
afaics.

Indeed. I'm guessing from the compiler version that it's Fedora 37 now
(the lack of such basic information in the meson configuration output
is pretty annoying). I've been trying to repro it here on an F37 box,
with no success, suggesting that it's very timing sensitive. Or maybe
it's inside a VM and that matters?

regards, tom lane

#26Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#25)
Re: Direct I/O

Hi,

On 2023-04-08 17:31:02 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.

Indeed. I'm guessing from the compiler version that it's Fedora 37 now

The 15 branch says:

hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023

So at least the kernel claims to be 36...

(the lack of such basic information in the meson configuration output
is pretty annoying).

Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.

I've been trying to repro it here on an F37 box, with no success, suggesting
that it's very timing sensitive. Or maybe it's inside a VM and that
matters?

Could also be filesystem specific?

Greetings,

Andres Freund

#27Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#26)
Re: Direct I/O

On 2023-04-08 Sa 17:42, Andres Freund wrote:

Hi,

On 2023-04-08 17:31:02 -0400, Tom Lane wrote:

Andres Freund<andres@anarazel.de> writes:

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.

Indeed. I'm guessing from the compiler version that it's Fedora 37 now

The 15 branch says:

hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023

So at least the kernel claims to be 36...

(the lack of such basic information in the meson configuration output
is pretty annoying).

Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.

I've been trying to repro it here on an F37 box, with no success, suggesting
that it's very timing sensitive. Or maybe it's inside a VM and that
matters?

Could also be filesystem specific?

I migrated it in February from a VM to a non-virtual instance. Almost
nothing else runs on the machine. The personality info shown on the BF
server is correct.

andrew@neoemma:~ $ cat /etc/fedora-release
Fedora release 36 (Thirty Six)
andrew@neoemma:~ $ uname -a
Linux neoemma 6.2.8-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 22
19:14:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
andrew@neoemma:~ $ gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
andrew@neoemma:~ $ mount | grep home
/dev/mapper/luks-xxxxxxx on /home type btrfs
(rw,relatime,seclabel,compress=zstd:1,ssd,discard=async,space_cache,subvolid=256,subvol=/home)

I guess it could be btrfs-specific. I'll be somewhat annoyed if I have
to re-init the machine to use something else.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#28Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#27)
Re: Direct I/O

On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:

btrfs

Aha!

#29Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#24)
Re: Direct I/O

On 2023-04-08 Sa 17:23, Andres Freund wrote:

Hi,

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:

Thomas Munro<thomas.munro@gmail.com> writes:
Now crake is doing this:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1

The fact that the error is happening in a parallel worker seems
interesting ...

There were a few prior instances of that error. One that I hadn't seen before
is this:

[11:35:07.190](0.001s) # Failed test 'read back from shared'
# at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) # got: '10000'
# expected: '10098'

For one it points to the arguments to is() being switched around, but that's a
sideshow.

It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.

I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.

I can run the test in isolation, and it's get an error reliably.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#30Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#29)
Re: Direct I/O

On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote:

I can run the test in isolation, and it's get an error reliably.

Random idea: it looks like you have compression enabled. What if you
turn it off in the directory where the test runs? Something like
btrfs property set <file> compression ... according to the
intergoogles. (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#28)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:

btrfs

Aha!

Googling finds a lot of suggestions that O_DIRECT doesn't play nice
with btrfs, for example

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html

It's not clear to me how much of that lore is still current,
but it's disturbing.

regards, tom lane

#32Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#31)
Re: Direct I/O

On Sun, Apr 9, 2023 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Googling finds a lot of suggestions that O_DIRECT doesn't play nice
with btrfs, for example

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html

It's not clear to me how much of that lore is still current,
but it's disturbing.

I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.

Here's something recent. I guess it's probably not relevant (a fault
on our buffer that we recently touched sounds pretty unlikely), but
who knows... (developer lists for file systems are truly terrifying
places to drive through).

https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/

It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*. What are
we doing differently? Are our multiple processes a factor here,
breaking some coherency logic? Unsurprisingly, having compression on
as Andrew does actually involves buffering anyway[1]https://btrfs.readthedocs.io/en/latest/Compression.html despite our
O_DIRECT flag, but maybe that's saying writes are buffered but reads
are still direct (?), which sounds like the sort of initial conditions
that might produce a coherency bug. I dunno.

I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]https://fedoraproject.org/wiki/Changes/BtrfsByDefault?). I wonder if any of the several
green Fedora systems in the 'farm are using btrfs. I wonder if they
are using different mount options (thinking again of compression).

*Probably a good reason to add a more prominent warning that the
feature is developer-only, experimental and not for production use.
I'm thinking a warning at startup or something.

[1]: https://btrfs.readthedocs.io/en/latest/Compression.html
[2]: https://fedoraproject.org/wiki/Changes/BtrfsByDefault

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#32)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*.

Yeah, it seems moderately likely that it's our own bug ... but this
code's all file-system-ignorant, so how? Maybe we are breaking some
POSIX rule that btrfs exploits but others don't?

I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]?).

I have a ton of Fedora images laying about, and I doubt that any of them
use btrfs, mainly because that's not the default in the "server spin"
which is what I usually install from. It's hard to guess about the
buildfarm, but it wouldn't surprise me that most of them are on xfs.
(If we haven't figured this out pretty shortly, I'm probably going to
put together a btrfs-on-bare-metal machine to try to duplicate crake's
results.)

regards, tom lane

#34Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#32)
Re: Direct I/O

Hi,

On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:

I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.

Oh, but we actually *do* modify pages while IO is going on. I wonder if you
hit the jack pot here. The content lock doesn't prevent hint bit
writes. That's why we copy the page to temporary memory when computing
checksums.

I think we should modify the test to enable checksums - if the problem goes
away, then it's likely to be related to modifying pages while an O_DIRECT
write is ongoing...

Greetings,

Andres Freund

#35Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#34)
Re: Direct I/O

On Sun, Apr 9, 2023 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:

I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.

Oh, but we actually *do* modify pages while IO is going on. I wonder if you
hit the jack pot here. The content lock doesn't prevent hint bit
writes. That's why we copy the page to temporary memory when computing
checksums.

More like the jackpot hit me.

Woo, I can now reproduce this locally on a loop filesystem.
Previously I had missed a step, the parallel worker seems to be
necessary. More soon.

#36Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#19)
Re: Direct I/O

On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:

There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

[backtrace]

I get an ERROR, not a PANIC:

$ git rev-parse HEAD
2e57ffe12f6b5c1498f29cb7c0d9e17c797d9da6
$ git diff -U0
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
index f5bf0b1..8f0241b 100644
--- a/src/test/modules/test_misc/t/004_io_direct.pl
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -25 +25 @@ io_direct = 'data,wal,wal_init'
-shared_buffers = '256kB' # tiny to force I/O
+shared_buffers = 16
$ ./configure -C --enable-debug --enable-cassert --enable-depend --enable-tap-tests --with-tcl --with-python --with-perl
$ make -C src/test/modules/test_misc check PROVE_TESTS=t/004_io_direct.pl
# +++ tap check in src/test/modules/test_misc +++
t/004_io_direct.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
No subtests run 

Test Summary Report
-------------------
t/004_io_direct.pl (Wstat: 7424 Tests: 0 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=1, Tests=0, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.41 cusr 0.14 csys = 0.56 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:460: check] Error 1
$ grep pinned src/test/modules/test_misc/tmp_check/log/*
src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log:2023-04-08 21:12:46.781 PDT [929628] 004_io_direct.pl ERROR: no unpinned buffers available
src/test/modules/test_misc/tmp_check/log/regress_log_004_io_direct:error running SQL: 'psql:<stdin>:1: ERROR: no unpinned buffers available'

No good reason to PANIC there, so the path to PANIC may be fixable.

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

I would not. This is only going to come up where the user goes out of the way
to use near-minimum shared_buffers.

Here's a quick prototype of this approach.

This looks fine. I'm not enthusiastic about incurring post-startup cycles to
cater to allocating less than 512k*max_connections of shared buffers, but I
expect the cycles in question are negligible here.

#37Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#35)
Re: Direct I/O

Indeed, I can't reproduce this with (our) checksums on. I also can't
reproduce it with O_DIRECT off. I also can't reproduce it if I use
"mkdir pgdata && chattr +C pgdata && initdb -D pgdata" to have a
pgdata directory with copy-on-write and (their) checksums disabled.
But it reproduces quite easily with COW on (default behaviour) with
io_direct=data, debug_parallel_query=debug, create table as ...;
update ...; select count(*) ...; from that test.

Unfortunately my mental model of btrfs is extremely limited, basically
just "something a bit like ZFS". FWIW I've been casually following
along with OpenZFS's ongoing O_DIRECT project, and I know that the
plan there is to make a temporary stable copy if checksums and other
features are on (a bit like PostgreSQL does for the same reason, as
you reminded us). Time will tell how that works out but it *seems*
like all available modes would therefore work correctly for us, with
different tradeoffs (ie if you want the fastest zero-copy I/O, don't
use checksums, compression, etc).

Here, btrfs seems to be taking a different path that I can't quite
make out... I see no warning/error about a checksum failure like [1]https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs,
and we apparently managed to read something other than a mix of the
old and new page contents (which, based on your hypothesis, should
just leave it indeterminate whether the hint bit changes were captured
or not, and the rest of the page should be stable, right). It's like
the page time-travelled or got scrambled in some other way, but it
didn't tell us? I'll try to dig further...

[1]: https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs

#38Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#30)
Re: Direct I/O

On 2023-04-08 Sa 18:50, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan<andrew@dunslane.net> wrote:

I can run the test in isolation, and it's get an error reliably.

Random idea: it looks like you have compression enabled. What if you
turn it off in the directory where the test runs? Something like
btrfs property set <file> compression ... according to the
intergoogles. (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).

Didn't seem to make any difference.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#39Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#37)
2 attachment(s)
Re: Direct I/O

On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here, btrfs seems to be taking a different path that I can't quite
make out... I see no warning/error about a checksum failure like [1],
and we apparently managed to read something other than a mix of the
old and new page contents (which, based on your hypothesis, should
just leave it indeterminate whether the hint bit changes were captured
or not, and the rest of the page should be stable, right). It's like
the page time-travelled or got scrambled in some other way, but it
didn't tell us? I'll try to dig further...

I think there are two separate bad phenomena.

1. A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as . I can
reproduce that with a stand-alone program, attached. The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line. The system log/dmesg gets some
warnings.

2. The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved. I attach a complete syscall trace of a repro session. (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.) Working back from the error message,

ERROR: invalid page in block 78 of relation base/5/16384,

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing. We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it. This was obtained with:

strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D
pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal
-c max_wal_senders=0 2>&1 | tee trace.log

The repro is just:

set debug_parallel_query=regress;
drop table if exists t;
create table t as select generate_series(1, 10000);
update t set generate_series = 1;
select count(*) from t;

Occasionally it fails in a different way: after create table t, later
references to t can't find it in the catalogs but there is no invalid
page error. Perhaps the freaky zeros are happening one 4k page at a
time but perhaps if you get two in a row it might look like an empty
catalog page and pass validation.

Attachments:

repro-strace.log.gzapplication/gzip; name=repro-strace.log.gzDownload
���2drepro-strace.log�]�s��������yO��>^��V��D[rI�U��X�(�L�Z�r�����eND��$�(Lwc0}�����������RD�Q��%���/�>.n�����|�8���U4������"_�,������;�z��������7p����4ZE��a��?]���2KV�����h����~2���1����_�{4������v:��F��������y6O�H��?#���%D	����#!F��e>���7���/�1��Q�������o�����RJ�X$2B��R���mtW�<j���������������/�g���w'���/�~>�>��>��_���������!��@j��,�sRB��$���R�J�f��|��n��Q
�p9����8������������(�O>�&�Y6Z,G��/�|�����w�<Z�<����/o`Zk�l=5V�p6��?����n������It����o?sw��Q�p��7�o�r1~��������D5T�����f5��/�k&�����W�����?��Y��J�<���NMh�����G���.(�S��&��2u�i�L6��1�f��n��$�'��'�����������%�b���(Ix����m����x.��x?o��6
��P�6��i-{M�3�3#x��z}����s����m>H�W��.���j|�����/DKy0�o����� V� ��{��l�`���'�f&I�|���i����@	xy��H��W����E!N����m�����P\
�h#Q[�}L����`�qBFJ�cb��y����1���S����;.�h�Ei1k���o���>P��?������(��y���q��<>�����
Z��	�6��%V*��3*r�7���U�5~b	���IjK���w��HhJ	w7�!���~=}S���1_����U!WF��w�4g�1����Y1t�i!o;��i��P��E����h��M��7y�M����t
����������$�j��R�P{{����js.dS�����jsPN���v�p=D�pq�YZ,n������wI��J$B�X!��L���-(Vl��e�`���1U����]����@K�j���_�_N����i�P������}��C����)F4�P���oF������v�V)�N�H�Um��\��h���.�{2h&�U��U�t�u�i*#lMx`M�AJ'���C��JM3��(xB�|�"�T��2l��(){�-m-X��l�����VE"DK��@������oW���]��UC��	[G��@�
�=RpA0A�{kZMx�-%Q�
��<!8G��1e -�;����YAd�uI�gm�N���5�,�����S,D��BHt�@�BtIE������\���;\k{�*�4�;d����hW����������aW��koW�72+���2+F[����_�k�@f�/h+���$����BI���J����V ���&�D7S��f�Ie����6)A�T��e����$,
!D�jRp��)8��&�R�o#�|���a
M��m��n<SB!��)���)�Z�n�ehx��9j~�axYI?�1Y�$KHl2�T��
�\~�B����*�O��!��TW�2�������)b��$���p���2�!V
�eE�)�B`HXN�`�UR���4�����Y�]t;p Dm}��Y�p1����j���R�)����1��`*�-�8�bL� @���
	N�J�����Z��[���Z������-hI��@����������?����4�)�*�����_��^�W��e��P�
�����5X� Bu����a�A�g�� aV`MR%6�\^�@ �#�#x���l@��Q8�����:�i���?�U:�sFU+5�)���e7T���Bh�1T5�d.|�S+0���n
^K�8������i}4��*K��K��R!���Fu8#��%Q��b(��
��1�CJ���������*��y4Z�
�51=d�����@��}��Pe!MmX�=��u���2=�oI�5�[����_�y_%���
�`�sq���	c1��>(#���cl����`������t(����W�=��M
�p��=
J2l������w���p�C����9\��������Q�d��a���}F�,9u�J8n�����3
O-�^�+
#-m�o�(��.%~��DF�X`r}��F������0zT|!,����J�[��ZI��Y��d-�T����MH�}�&�B�
8-�3_�jk]rR���J���:���5hp�j��7y3��n/������D�)� �z���UXJ?)2��������81����V�rciMIUk��������2+@�����c�R#���1UJ(k�ck�������`h�0F��SJO*�L���m��B��%��C
B]��h
�P���b����Mu,^��J��1[yW��w�k���U�< c���������$�3�X
����w)K��\��z��
�����l�l���$�G�.ZktQ�0oWp'~	���4V�23��@������{�(��*�W��*���@|\4�������
J��Pe�W������`~w�X�Zl"kw<e:K��=�k�3���x <�QY�Mg8^��A����P��4��Tk.���k���
j��KF1_=$lH�!>�C6�W7�����zjS��l/nF��������6����uqbU}
�E��&!bmP�\C�s6���"�	`F
���.p������\E�g�|?��Q�m��	*��l
a�h>�����|�&��gL�P��#���.5K�Z"DC�� /�a�)��n)6���|�k��F25@��E��b�	!���
9�a��@Z��Q��Y[9�#R���E(�;�!��K�R	$Du(������H{�!kj���]�x�$�����s���{��.��+kx���-���y�|������Xm���s��#������a����/�1d�L&��6R|��$�1�8�cf2�s
D��e�1�ay}3�'E�����(��^S+x��VC�3QV�����~���@i$*�kJ��N=�W(��x���w.�u
# V��Rg�R*��XX0���(ou�����Z��wh�s�La v��s3��Lw7�1�o��X&����10��"~&m-����f,-f�+u3���,o�;��v����Y�f^����z
�5�H�e�N��)���p���T,p
g�B_c����I=��[�*�*�x�5��L��3�AN�J�kC��+Rn���wH��W|!P_����!�m�J�J�{g�j/��������-�K�Lu)�Ki+��]�%�!ni��yG��1bG��-U[��K�>G�.��
l�F������\��E0<�h�Z��HmQ�VD=�BR�x�m2�"�X�8�2����>����@/Y���i��c5R�����.t����V���H�8����8-1}|k�x���)�HP��f���-�H�8e��q�
z,�"d����E�����w��q`��xMmc�]�V���j:k�������<0�;l�3����j&����W���c��Z���S��h����[t���J�[M��9�*l(����i�0F�j	�����ss��0��^�����"�5TN�������mb�#�q�v��91J�	FIy���Q�2x�=b�=�����5���J�������]7�V.�O[?�_���O��a��^��]�7��P\7L�b>F�1;n	J#�'�P�GI����!�7�����'�8<��J��@tus	�2U���Qa��]#S:�B�o����������4��6�c�^� ��w�Z\[3��r����/?��$��D�f�0���������J��C�rW/����Q�1�&���g ��L�|\}Oc���*��O��r:�i 
�1hz�uuh�gi��	y����I(��C��{���Q%���b%+U��c����ZZ��ePX[m��:�p2`m�m�z2�Y�B���'��k����DM]E��6��4�}h�$�C-8_�0��T�g;����i�>B���-t�{������qi�~$P��Y��>+(;�����*�wP�2�����q�*#[�R�l��eNcLC��tg:�O�N�T�����D��=��G[����j�N����E���Z`d	B�i*�qU�5�j]00H�j�z����`tc?#-k��������,m�i\�=��G�_��4��V�zU��`}��<��M�����	�Rl���Z9%uO|:�i�Gip���$&�s�)��[�b>I�\gR��d���%h+�������j>(�����(��2/��!�qm�c�L��Z��[)���X�� v��FloK��� ��m���(�[)�BC0*CnMi�>��=L�7���������CS��r/��&�d�6�Fh���N���nMj���A7%��-��'`WQ�%08�HM��\���|�	�K�C�u���
�?4�Ms�g�<�����h�O&����A�����.v�*��l>�v�V�0�`�T�~�>>�n�4H��&vN��0����1�c%I�-�n����%uO����5����1��Y�){�	���	���S�FE��(Z&�w�d5{yr�����)X����aQ�����6�l5^M�q���,�_n�I�bS�	o�k���-�=(�l���qY�d!?����g��qq�V��	�_�N����y���b�%���'b�.�X8�Q3��Gk�a�kw�st���^G��F[��<z
��U����'�%�����Rp�t�Cr�W��y��
v=2K������\�^������f�	�?��d������R����������������]�i�r�@O�s*�u�pFl�(�g�v1g�F�e�����6
Gs��K<:��8��>�sQ���\�,�u/����P����MQ��H;GKY;������<�u�����{�$����{��M��xlq������~�|�����[>����-�������v��s>����Hj8�59�����_8/h���;g�hIk���[tw�2��lx�j������3��K)�����)�TmYp��k&>��*��n�T���lz;])�F\�={wv=��>y����4R��������%�a��F�n����������;d��J��7kCS���)j(a�����39����>e�y6;J���,��>l��(z��Q��Nr���5�����af^��~zz9~w��$v�q\&�����f��S~L���e�U���tV~v���+&_������������.`�����yt����1s}�����k2�n������'o��|��W?�?�9����2P���������������0�����+jd�FW�Z��2�x9~}��[�B��)�4~���/�N��������a�����.O�=��
EW��z�_��O)<�U����C]��t��w����_;
��y|v�r�
���h$�d�e4O�/_|x���I?<���><��3�|x���C�SP��/��������x�c��x4!^��OE�'�9���H�@�(L1��U"�h jq��.�$�5�z�
���M����*9*�%��dz��L#'Ydc;a�z>[�[�J����I�m�j#�l<!n��2����n��n ���2]���*)\,������e��8^�/~�X�ZQ����2��M��*:�2i�����C:;�3zy6�5MG����6��6�
���`��x�P7�	����O��������0{�d��=�������E������A���P�����������2�x7�M�l -X��>Y�h������=z[H8���������T���.[����JU>�$�LIIE�M��"A�kTD0�����;sA�8���%M�*L"A����9��>�3�n�����z������������"^7�V�*�������Us
��S]�����yW�����?<|�������������P�#��w.:�q59�Ud��Y��l�b=h��B}������r�r(�iP�/E��/�e��Le���R�����}��7��������������)�.�_c��<_y�xW��W�p��������cu]�\�y��`��}r5�:�~�b8}=x�j������
o�g������W�~^�����YU�d�~V�:�X0��k0�k��uR���h���3J�O���{��`�-V�O$>��m>��#�51��<��R��������	+�'�I�wz��8�Wc7�����}���!��c��K������z����.�������9��i��������J)U�*�.t����t���?t���+�z�c���q5u<�>%5��&YA%��5�qP�q �=;(��z�>�����h|}���Vg���\�B��E=�kp��������1~��A�����$��w��N����j�RR�����<I��r5�V��m�}]�;>%��]�;=I�+���j��������%���og������-�|*���������C�zwy/�J�}6���i�8�c}��U�n�����Ul|��������qk{{����3��vw��������`o����G/��Q��������������������gMN/��I�^����}_�����?_�e�asbV����Z�?��S��Zq�,��^����?�?o�����O1p�������$7����6>~�������u���U�����9�������/���}�m������6
MZj���bj�[O_|^5�5���~�5�����]�r�����P����8kN?�X��x�l��J��������G���ov������4e|xvrlj^���v������<�d����*�Lj\bq)��z���'+�hG�_���k��B����d������dj�]��U���3��|��}��f��<����	�z�G�8.���xR���Q���`A-e��������k�����)�Ljo�=����~����r��3�`�Q����g�pT�>v^�������.N3g��r}���V�W+�=_��l�kCY��������2��\��7�SsS��o����������=��<�L�L������+^Z�/��TZ�gg�.Go�9Q�p0����O�Z������Y-����`Y�yP�B�)����s�i�+�a�u�c�����(��}�F����>T�����j�9��PM2{_���L�k�����s��<?E�PM9��C����n�^M.���\��xm_nu���z�:������d0:.�`���r:?_����0�]_���}	�5��/����vY1#]�b���v	����A�19=���xt[�IM�XC����r���:�'��P��o����l|56(��x:�|�q}�l�H�PX��9��/{�U�n������.�)��]\�V���-��b[�)��]���r�.���:����"tI������'�}�v��+��|���;�5p�w����<N��`���_x-n�b�-od�bX�7�d�&t���q�u�0|Dr�6��I�f�4�N��n ��i��Eb�i���'��~����������{�0w�6�r��Ha�w������)\��+���"p!�[jb_���'���?�j�����U0,>��S����R�x�U�����.����l��S��!t�=�r4��x\���	�~�U���Xq���p��Q�k�x�TpPk8����xX���p:����7�U�0>�5������bg�h�|�y>ZSX����6��:?}�V��������dT��y�_��\WX�����r5��e��7���jqk
�qNk<�n��x���b�M3�Y��������+l�)������t��jMaY��������_�.'������c�����m��zz9��������������<(��9S"Rl�C#�cg������_��{[/_�I^��s��o�����qNWq�=M6�#1"5���G�f�B�|v������mYO&/�hGx�Wm+�z�����=X8��P����W��������)��M�M�u�/#�L���������g������������1��?>�NF-sm�M=�P���}�����
�e(����`�2NE�+�O����Z��z�����JQo����6�s����O�O���_bY��r�a��w��hi�����U���.������zP����?�8P�����q���WW���rP�����Ljv
�Pf'Q<2��J
��:]�������t|r�j���O��Xl��F����2@��������b���%.8��e������'�>������l�0o|�i����>}��������wP�i�.�����V�����y-s_*]_��.T�>������e?��w�����(�w������`�&��_�^��n���[=��������G��jW��S����7���uI\������K]uY�U�0�{����)�������=h~
G��
_/��yt�#�OuO��lK
�g�L��P�(���L�o��^�g>������R�W��K�Z����w7L.;���b=����S��wyav2:`���u�
6��sV��}l�mY���6XIZ~�M)�/�
}-�e���{\M����������yz������p����>_�^�K=$�V����`��W��-aISW���t��UR��#";��bs=TR�����*���������M}��Fn�y�r{���j:����
�'�5�ml�_��r�bB%�����]������Wd����vf�f�9��^��YT��f�k������><��J�N�����qk
�F�&G�0��#n�[���X��V}rv��N\��"`EpPF�|CKF��5]��������DKF�����nA��1'�`=|N��`����@�j���!����DgZ
`P/9#8V`�����K�8������`q�)����"�U]'`d�'F�"�A)�o�$�	2�����Xo2��2�<����L��f:���n��kx������N�%�7�+���I�A0��^L��M�b�=�#�30X&7��#��i?#��xd`d��Q�A��q��@����u��+02���&�d0�Q��D'l@�r���`�����`=N��������^�"��z�%{�����#�y���6����AOKN� 7\2�
��&:�2��	,2#����	
�7RD0��	��Fg��`��d`0�sv�f`0(e0+��o���`t����sF��~�����j1fd0p���Po�`h�0&��I���5�b�[�k����P4���q
.s4��I���Q*]�MMX��L�i9�g�D��(���)�VDG:NP��Fmo�M�1-�F.���
U���/�2M�3�F.5�
51r�
��XO�14r���!��Ot(����l�2s�
�����P�)��������Mr|��d�������*fh����(�L�'�hCt���n�<_!P�I��$��$.�8�n���(��|�x��ATo��w(����C�&e������r�P����S4pY��S4pY�</;!�K���T���Q�)[7�
P��k�Xr������k�\fnC��S
����9�`�I�y5���L��s
��(Eg@KC[9/�T4�����Sc|���t*���yM���[���NE�����S�\�r�T>s��|�
�eR^6r�5#��L�(�K�P}�EA���G���7u3�!���\���2 �{0�,�>oeB4��\���2������L���6rB.��	���I�e��8!����	�L4�r�8;	��|�M��4|��
��=��Kq���hnQg�R�{�����\c����6D���tTe�R<�!2r��+
r�z"
r��&
r�U"i����H�F.�l����Hi�K��J\��PtB4��J\�k(�5�6�v��{�8�R�g�8E4]c���4p(�#�����3G#�F=tq���T�\rJ��H"���9)�\j�� ��FG�F.#��E��H�X��f#24r���#�\&jq� �<
B���
E.3�E�����t+E4�;j��sG�Kst/
��6�(piBV��h�t��|�7���Z�b��RC.���ri4�P�4�r����!���r�!�<]R�4�K��1/��e���G.�>(����x�2Q�L<r��:���L#{����@�u��ieB4��P��
��u���o�}������S&]OP���G,�>^�
���WnC�����+�������7nC����[���x�m�}��*����y��������}���F.��,���H�XA�������D�����OT��}|�6�>�G�u�������OXc)����}��u������Op��B�'��5A�'U�u� �cD�'(��Q�	���}�Q5QP�	���}��u������O���>�s/u��&Qu�������H����"�wu��
����DuE�'�x���2�2u��������))�>��^�������DG�E�':�e(�>��NS�}�P�MQ��B#6��O�e(�>Q�����DT�}"O�W�}��YQ���i��Sz�u��������@#��OtEV�}b�����#]�u�������Dm6E�'&j�)�>1Q�MQ����l��O���Q�}b�+2�>�g�(�>��+2�>�����O���u�z�C�����7�>I����O>�P�I��A��l�t����&�LQ�IJ�oE�'������Z���O2�k*�>��.��O�|���<�IP�)�eu����H�}�*�>)��
u��������g�>�d�3
u��LC�'E>�P�IkV{�}R�i���OJ�FF�'��P�I<�VQ�I�*9��OZ�����x�����y����&�QQ��<�GQ��
�8P��kv)�}2��u�������g�u���E��^�����d�+�>�gf(�>�����OV~�u���T����5�V"�F�*��O6P�}�����O6�W)�>�$$E#����+�>��E#��*���O�������}P����>����0V�}���F.y\�P��\6�}2�b�>����>�����ONknV@.3=5c��d~8�P�������O�4��@�qM��81�}\MH�h�h�0�}\�3�
tW�M{tWM)�#��[�OAS[�@�q
��t�z����!�����R��l�\*]�M�K�2����`�\*]�M�K���	ri4�����X&���+�	r��U-�\��-S���Y���WdE.����e�7�(r�3GM�K�f�\F4E.#��M�K��m�\&j��!�<���LT{4C.� �!����f�%�P3C.s�=�\f�Q�>��t���S����������K�������_)Z�G,�>M�Q��9�#t�z�0�t��r`�F.yf�y��_�b�T��X@.��>�K�3l�4���\��, �F#M�KO���K��|@.=��r���e��e�=��@U~��%��,"��z�������e�vUD.#��"r����L4�h����%�2��K�e���XB.3�3��\f�j��SM�_��S�T-7�}\}�EGD��t'�{��,�;��4_}@�q"����z�=�OAS]�2r���XF.�s��K��d����e�r�N��K�W6�����o�K���|�\r}�7�%�}�\z~a�\��r� ��f
��k��k����Q�~W�G�������K~��G�G�E�u�z�=E#��F<�>�O�{�}��;�Q����Q����Q�YsO�G���g�hA4��<�>Uc�h���)�#�_����:�jz�}T���Q�Q�E�Q�Q�5A�g��D�����.u5��������u�tO���h����_���;�I���W�����3����rv�%k-Qj����������5Er��/c{~����@V!�JYcw[d)*��� �sU�8�1B?����������o#;�}L��J���	����������s�q���5��>�u����>�p�c���^�}l�L���C=%�[�k,�>�>3u��Xa����Z�>�s�}p	^�O8���o�~s�����:�����^&\�P=��8���8������u��n��>8y���}p�]��s��V��s�������}�9�}ph��w��u����8a,����i�����c8�}����8�qu?��s*�)�Z��8�}���7�>N�9����������>�#	�>�~f�q�����y�}\��9�}���p�����s'�>�����Z9���Y9���o�:�}��z;v��@}��q���<�>�j��.���t��J��>P�O�8���}����>��7��l�n�>8=����.��s����}@����{����k��@��2p�ua����<8��Pmi���)?p�����s���u���OL�>������O�>�����}��r���W�q8���:��>^���������>����s_�O��xS�������+p�����s_�'�������������s��;����p8������sU

��w��s�1
8��u���7��s����s��{^���U������Sd�����p��C�>7p��c�g����o��>>�GW�}|�+1p���]��s�P����O���}B��p���n���t;n-����n�e�B��������2�Zc9�	�^O8�	��f�s�`����'�z���'�c��>�N�s� ����+��kY�����w_��'�z��OpUR	��W]��>�~�8�	B�7�}Tw��s�B���^H7����8�	u���O��'�}���s�F%p����s� �����X����O����}B�S3p�b}���Ol�;��O�{�}bS�/���zN��'����s����8��JH7p��|�s���|8�������.�k����D]M���'������'����s���j��4����s�h�T�s�mu��9������s�h����s�X?_�9��_z�}b��y�}���yy�}�����S9��O��/=�>�������z$ ��O���z�}���������^Uk�e�7���'��z�s�CuO�s�C�Cx�}b��{�}b�����'�*?�����?S��O�U����M���g���|�3��2Tk,�>h]�c����mM��ZWw<�>h]=s���M��g�G7u
��A��-So����n���{B��kY�W�-����Z�����\KS���r-�'���Z��i,o��VH7���c��Z
����UO�z���G���kY�Z���ts-�����ZB=���ZBuO�;����Z��k����e�?�����o\�P������2�������7p-��o����������^���h\�U���ZW�i�q���K��Z��;���t)�j��T�^����u}����V���g�G��'r��O
�T���kY�5��k�����������kY�1�����i�ki������GD��ki��I�Z����������kY���#����G����w"��U�`�����_�����}��k	�uC�ZB����\K��w"�����G�e��]h���%��Z��z'4\�z,��p-��OC�����
�2T�;��Z���;4\�X���kY���e���>Z����ir_����:����uu���i��Z;n]��}��:���h]����)�U����g���Z�O�����oAs-M�B�\KS=�4��Tw���Z���r�\K[�]�ki���As-muw9h�����s�����G�z�9���XJ�s�����s���gs-AH7��ts-����>�Ww��>�Ww��>��;p�#D����P�����������]�������������
���X����X�W�}t���}L=�K��'E��Zkn]/9�I>���L����j��u}���G�y8�1�x����{
���z�������
���zT����1���sc���s�p������������S�sc����sS��8�1u�D�sc��"p�c�u�>���8�1�����W�}����j4p�c��������[����28�1�(��s���u�}L���}L��D���������NL�Zr�cB��z�����l�s����sS����a���G�8��������m�����6B���V�9�>��(p�cU}~����{���X]�!8�����"�>VW{���������XSMw����=�E�}l��R����*����X[�>�s[�����X[��sk������������XW��"�>�U���s[__F�}l=T������G�}l�4m���Bu^9���:������w���������s[�p9����!"�>6T���s[__F�}l���#�>�������X=�9���z
8r�c����s����s�Tw�"�>����}\S����`wRm�����'������>�����������M"r����L#�>NW�G�}���[���8S%"�sg�-�sW�O9�q�J�"�>�NX#�>N9�q��9�q��Wq����i#�>�~�5r�����p���J�#�>��k9�q�z
!r��|}������8����{�����zl������sW��9�q�z�(r����2r��b�qD�}��������Z�8���Z*r�u���s�{G���w
#�>P?�9�]o����;"�>`���>P?�9�So���@��B����2"�>`�{�sp�9�> �I8��bG�}�1r�u*9�������T��D�}��^&\��'����}�F�}�W�@E�}���K�} �k,�>�5�s��o�s��q���@��}8��������*%�����)�t���W'�)�7��h������I!��9&����zx�v��1�	O��&�N��X�&� �����mq�:�
��ify��-�[���[��A)�6���:���V���m���TG��-�zS��p[�;���J�cij��cKe��c��J[��>���n)^��V�
������orD�������[ZLmE�|nZ�-�r������&�r_�E�B�=��������[�������gf+�������[h��\I��Pz%�P��\I��Pz%��1����n!�e���nA�6o%��
���nA��\I��)��n��+��)��n&��V�-������?�����x�D��Y������}�usf6^8Q���|j+�����D� �YN�-y�����Twf+��N�OmE�|� ��s@�-�4��[�w<�uy��n!����E=�{��[�3�@�-y���J��&�� �6�u>��J���6+/�����n��T/����m%����V�-�\�^�-�S^S[I�hr_�E�L�K����}Iu�y�D������Q�D�l��us:?W����rus~V��4���n���A�
��(����VS[Q�Du���n>kE������[�u��[0�V�-y��������S[Q7l�3[Q���X���uli-���q[�4y|SM)\������c��:�)�������R��6yD����ul���c����M<f[
���2����'[I7�[�t3�:��c�����m^��0)m�	�������\I7�J���m�W��Qz;��cK����-��L:�Ae[I7�kS�&[bx`��u�\I7�6�&���kS�&��6U`���\f`��
�]t�Ii�v�g��n)��V�
Z�V�-z��[l�W�-��+�=�WA������������
�����d�tl��w�Ii���Zu�I���d�tlI�0����`R��[�3[A��3�J�ijC`��
��%��i����[�}0��tk�C`R�Z��L:��Ov�I���}�tl[e&��Ze&��Ze&��<�_H�%_�3[I7�\u�I���E�tl��������"�L:�y��:��cKs�0��Fz��[h�W�-�|�L:�y��:��c�`R�cR`����c�&��iU���i(o�n����L:�T';��cK}u������0��R_�&��~���n�q:���m�-`�������-���L:���&�F{�L:�&��0��Rz;��cK��0)m-�!;��cK����m+��n��o�����^I7g�$���0���|�LJ[��W�$]���/�����x�����x���?�/�>3s-����������-����[0�\I���$^�CnoZ�%��CZ�%��CZ�%:����x��y�AK������x�8v��V���H�3[A7�bu�l�����K�B �
����/I��l+��<of[A�4e[I7������n���t3�Gi���}Z�%��Y������x	��mJ������x��y��%^bl^�h��G�W�%��x!��np�l%��������s�Z�%\no/1@��$^b<���K��>J�%���x��4H��Ei�t��!-�COK����/1�1K�%&�5��x�i�Y/�fL�t�Mf�Z�%��si���&�i��Kl�����T�nVe��%^bUf�Z�%V7tQ����/�%^bu>�%^b������[�o2��t3T�/��R�I�%O�3[I7:��%^bQ�l+�f)�/�t�XK���|�WK���/��R9H��9-�KlPK���/����%^b!��[I7���+���O�x�%6�%^bC����K�w�g��n�qH�%6�\C�%6�X(��"-�3��/q
�7%^�ZoJ������8�}Z�%N�x�ST�/q�U��nNQ_"��5�3ts:���K���w��K�k3/qt�H����.��K���7I7�(o�n�R�$�l���K�k3/qtN�H�����6/q��(H���3/q`(o�n����V�
"�M���9��x����&�����H�����jH�%_�3[I�����n����K\����x����������$�b��
��=0/�&�#�h2�2/EK������[4/:m$^:�M��K@S;�x	hj�/��$�h-b$^&����K�d�d$^6����K����x	3/�)
�n6R$�pb�m%���4H��<?3/G�W�%��$^��F�%@�#������K�7�7I�wf+��i.'��4�K����x	.�r9H�B��2/���I��@���H�h=d$^1�_#����A��7��W��74��x	.�s�I��+�%^��I��x�����x��1/���/��P9�y:/g$^�M��A���4��x��-��x�y��x�7����xKse��xKse��xKsO��x:�m$^���o$^��	$^���4H�9Z�H������x�j�/�@�A�%���x��T%^�=��/��?d$^�i�H���Vz%�����J���<��/�t��J���9+���
+����J�$4��[�����KB����x	N5�]p����(+�����t��c%^t7��K����J�$��+����N��Kq9+��@\�J�$���I7��{ �f���J�$XjC/	t��J�$�����K�S�I7���Z����zV�%-�V�%��K�$�W�%���xI��i��K��� ��@w���K�w�^I7�;c%^BC��t��+�,�W�-P��xI���V�%8��v!�����J�$6��Y����P�bC~�$^��O`%^�<��xI��!+����z�J�$*�M�%Q��+����+��Hg��K��{V�%�P��xI�;�V�%�P��xI�;�V�%��NV�%��NV�%���Z��DGmH�%�Q�xIt�OJ�$:�'%^!��/�@�A�%8U��J���/��v�/�����K��y��K��|�J�$zjo/�AS$��7���@�T������xI��9Z��D�f%^���x	�ix	�� ���1@�%hK�T�������/A[��	�m�
	�m3��/����T���-�x�m�i�/�u9��^����x	�:J����>U�%��
�A��fFj^��y��^���#L�%�q�L�x	�ZJ���J����v/�
����K�6��N�%h�)
�n��u'��
y�w/A[�X�%h��B'����X'���sO'����7I���'��t9k!�����x	���
��|/8���-��x	����x	����x�Ut��	�m�N
�m���	�m�A�M�oM'��5�I7�9�x	�f�^b��������l�	�m�s'����7I7�)
�nt��	�m��x�U����K�6����K�����K���^���6^��4�
��*��q/A���	�m��&���w��K�
��W�-d��^��y-�^bU�kS'��j�� �F~���K�6P�4�7u/A[�O+�����^���� ���I���-��x	�R�x	�FJ���&?�N�%��O�K^��T'^b�i�V����%��*I7C�>��XM����KRH���^b5��s/A[�w^bSX�l+��Hc��XM�q��K����t/A[Mi�t���t/A���t/I�]����7�I��ZD�%��|RK��R'����'�����K4��q/����/Iu���[�y��Kt����KLk-"���x1�
�:��$^b�j�xI����x����N�%��" �C�z@�%�IM�t3���$�V���Ou�����V���~$^b�\H����$^bl�O��KZ�	@�%�R[��ts�NH��8Gi�t�{> �C�HA�%�����K���l+���H�����$^b|^��K��$^b|�4��B6k!�\<�S/i�G����3���Kp�5�xId������U �Kg�@�%��&��K��x/���$^b)	H���<��X��/�NS9H���A�x�uy/�`�V�
��K���<���X��/���\I�@e&�K��A�%6R%��|$^2��=��x�k(���K����8����n��<��K���_�x�3�� ����/q��)��h�x�s4J��Af� ����8�sp�x	.�s��x�#� �\�R�$��5$^���
/q��)�\Td[��@�mJ���H��u7$^��@�%8�#[A7��P�"A7��� ��}�x	�����/I��f�/�%k!� ?� � _2 � v/�N,k,�h�;/�N��&�F~8A�%@��A�%@q'A�%@�P@�%(,���ty$^�Ey�t��V�%���WK�b+
�n^5�A7������x��=a/�����K������/�o�K������%^����x�'�%^���x��x���$��N��x��y��%^�m��{��x��/��2k�/��'�%^�)���x���F^�%4�W����z��x:��%^��A�%��c��K<�$�/�1�3x��x���%^����
�:��%^Zq%^��tM>3�%^��+�%^(���xI ��%^�U'�u��m�V�-h��$^LC��t3��K�$����xI��3^�%�x��xI��l%�Z���K�k���t�;m^�%8���J��oB/��@�K�$��/	����K��{_^�%�S|S��:;�%^���J����K�$�@i�t�4/�xI�=K/��V[/��H�a��Kb���^�%8���[$�^�%QiJ��[$~�%^�'%^�K�<�H�$�A/��H�^�%��s�^�%���x��D���%^����xI����K�%�$^���xI$��%^�K����K"�/�/����K�$B>��%^����%^�S9��t����K"�E�/���g/�t.�K�$��l��KZw���Kb�~R�%1�n/��y��K"�/�/qMk^-������miM&��(��	�$����B�%hK�M�%�!�}A�%hk������-����F�����4H���J���� ����� �����/A�����K\���^��y|/A�|�?�m)���K\Cg��K�6�'��K�6�s%��.mx	�Z��t����K\C����K�6�� ����K��K�6�s%�"�&��ud+�F����K���n/A[K� ����/qJen^���l���/A�H�tS.s� ���l;��)Pd+�F{�A�%h��V��7d+�F1���K��V���gx�S�=��K�����[d+�F~-��K��x��t��Yx	��x!�\����A�%�u�$�m��,��i��A�%hK�M�%hK�M�%N��g/A[��	�m=�
�i��CA�%hK�	�m�z(�������8,��t#*A�%hK�M�%N��Px	�R{x	�R{x	�R{x�����A�%N�� ���%^�t��!�mi^-��%�^�Z�!��K���7����=� ����l�Z1��KRX�\/A�|�-���@��V��(G��n�����8��y� ���d+�f��Gx�KA5�����w'��K\���m%��9��8C��A�%�\��t��FA�%iV�m%��9��:o$^b����Kd�Q�%��z��t#��Q�%�|F��$���V����%^bB/��KL��|�x�	�l%��N[�x����Q�%��ZF��$gH�V�-9����n���N�x�U
=W��*C��n���Q�%��y�t�G7J���O�(����>J���<��/I�������/�t�9J�����(��t�;�J��H�+�F~���K,��/�t.3J�����(��t�+�J������Kl�,(J�����(��tH �J��H��n[)J��RL�(�Kq*��KZ�E��K���/i���/�Eo�X�%�x�mH�%�b9D����6k!�G����K��=��x	.d+�f��4J���.J�E&[I7�1%^�\fAQ�%��OF���cs=�x�/$^� �q��K���a��K�_D��8���/q!s�(���H����>�����D��8��%^�"����V�'�hh��x	44nJ���t�x	�|�"J��C�\A7P4nJ�(�N�x	���(��y�2J�4�S%^F�����(�04��x	�^s�x	�^s�x	��(��3�Q�%8���I��;/�~G�%��G���;E�����H��K��F�x	�o�(���4Q�%@g}��K *_I7b�Q�%3kK�(���f^���j������lM�����]�^����������]�*[N��Gg����������v���~���_l�����W�����U�>�����x�����������5_������������������������jp|z5<]\��{^
�<{�������a~�J���w���jx>:���o]��b�Ge����������O���������F>�_]<:=���=�~x5|tvqtx6|4<?�����^n��n>��}tyq=zw5��R�����WL�S+&�s�*���t4��15�O_6�K��������|��������a�����}5`��F����')��������O�������`��:������������_����T��~�<������R��<��^_�8�r}z�f���v�5���lm��}��������x�����������/��������o��'�N�~��O��x~py8z�q��O�W��>�?:�>:~4I��?R	5����H��u��C�3�J��e�H9��������Y��5
GG����OGG�bnN����>\7v�?��z�{{R����?�����LK�xV��*������w��+`�<;4s{�yyh��SE�[I�7�[*��������q�zs��9�
ap5<<?���n2�����l�r���l8�*R	�G�k���1��{3��r���3�+Zk�8�\��a�|q�S�z��[\7B������%�EH,~���=K/��B~�x6:�N}��I�z����>���LrLz�q4��~��?	G������G8cz������6��:x���������5\3�kS>)]�XP�\�>eo�
u�U3��O��[m���|x�3����=�|q������w������I�[�������XP�3[-�s���TK��\�
��r�6���������q>����M��1'G-Q���W�����������*}��*d
���[&}e�S�r�{p������2m{���^c%�Y�����-������1����Gl��8�=�)������������>+��=+���#��������1��]�{��u�Y��������L'�;���Y�����v^�k:�������qw��|~vpyz���������f�O�p�nq���<�����P~�x����+���W8,�����"=�����>\�;8�o�qJ�o�_/�;{/�m�m>y�3�����������Ov���fkw���oO��|=�������]�EN>����>=�}��;��P�f��|J�j��t.��?a������vc�w���
L����?m|�������p��v�z����w�
���{��r��-���������������0�`���O7�%��v^nn�g�W�G���>~�z���W�����=�e��d������O�}�*W�������q��R�p��F����A�@��hp��&�n��q�����W}����n���$V���������8�w&���������z������S����t�����O�g���#��?\n���(���?��[`w[YRWB���[?p=�����V���|v�����I�18C�
�PW���_w^m>y�������PjQ��fy�z��������0��=�\��Q��~0�q
���T������������N��^}b=�z�>k��K��LM�fz-{�t���s����4r�7�)c�L��q6'���|tuq�>i&�L����N�Q?���Q_�ns��=)}��-�rtq~�Rx�����p+��*
���\P��A�n=~��ds�7<P/�����
W��W����+s�?�F��N�������0�O��+mt[v�b��a.*�����o�YF�\��.�M��nB�'�_����N�C��q�'�����Lm�R$����&.n�^�N��}��l��A�.}2�}�q2�W7\�v��	�.&�z�NA-��	J`�Hl��	�b��Uv����W�[7P��*0�q������"}����=y�S�.c%x�z��������������6�'��>=?���m��Vq��I�/&��Y��|��Xt�\P�N���2F��������W�����6���� ���8&����q�`~����������������/�����}�_L�d��Qf5�J%��HH��>G*Q��|�4�/QG[W6`�/j����\OJ��r�������fv���fwk2,~����nj���3��������o!m�T~}��J��`g����yJ�f��:�N<�Mq�)�d�:�4IH:�tq���R^�����R���Q
\m�������*�}9�:h��W�*���Xi���_hlcf�yI7k���%��+�fI7�I�^�<�a���t��!>��!������v����;==�RS���}8=}�|t���pt��'nZ9U:R��//������.�<��3_�=;���6
��B�]o|�����!�\}�N��v3	M���S�i��������3�9����=+E�>�����~yn��Vm�C�d����P��sl������~�&����2g�<������g�{�0���g��s5J���5���[���edu+]�����e*C�����e�J����	���	��4���������&��C��&��#D���]}V9�����vHs�*�Nr�2���x��L:��s��w������a:��X�oeg�y�v��������� <�$q�nQg�����i��kN�&�pd)�*.��_w�Rq���=����{�e�����x��t���^���j����L�o�����Rz����:�:��?�|oF�
f^O�R�O��`����$M����7��S'���t\��%��)X|x������E��+����������j����*��[V:���U"X����������L�$��������W;�\>�����1�b�w3���EI��am6�����a��=���'m��%]}��k\�=�����^����������qTL�G�F���c��%����bE��^e���ul_�<���!����D�q9$�nsK�G����w��<�%,��Vw��m�s`?M�����
�|7(`��p���
)���	`�gb�����������<����z�^/�t�Z}B�5��w��g��G���
�p����p�|nq����G�A~R��!_;>6z��[�9����o�#�����_8�71r>��S�1��<��s�����S,�����{����pt�V�?���i����������G������������_g��}W�c~�����i��tO}`�B��>���H
A���PW!
�_s��N�W����j�}�a8xyx9.�����;��=,��{������s�����B*���k/��Y���J��t�kB
4{`�ay��������7���gi\<�Z��
ymW�W)c�.�.�P���>����������?����`og��`��;{����<}�5��z����+\���cw�����6�� ����<�?��_&��������`�����������+Q�[��~��	W[�;{������>�4>������������kL{���>�����~x����{F����������`�O�w�����<����\�����Z�L3:�_�ON�O����b�=��*����������6% s���!�2m+�+�!�Z���v������&%;bQ����C����_/�`���h��a���*S��zw*�7X���K���}�:<?����=������c��kg��R��g�������c�\�����F95�>93�T���P�Z 
.���r�O�H����Z�O_��M�*�?�
`
[y�m����S��5������_�����f����]?���N�WKmAR[:���>?<;9���T��<�>}B����:�����Ue�/�I��r���,�%9���l��>Eu����N�3�^^��ag7��[�_�x1�w5K�i'1I$+���[�����q�9yKy�W��
�����{�c��}�����v�o�M�����������}�agn�^������y
F��3��Nu�{�d��g����"|c�7������3���7���|���Fo(p)�Lp�`��FL.�S���s���������xv��	&��l���p�6s�e����%�Q����e�n�E���y���A�w�����m�����7����p��h�YH�?ib����i�������/r�����Q�	���Y����+Gz�?��������{�|N<&�������8��}�O�0WK�
���|��K���A��^�)�0���#�uM���Q������<���x����a�W���9���s������?����v7v�v�^�l��>�gMc�htq5����;<��xd��w�ooR�K:s2���p:5na��������M��kV�(�&lYQl�LEIQ*n~b���V�6�O���W�o���Nu.�_��f6�-3W
�����DP�VS�W���5YY9�VS�kQ�������=
nr�[hm����(J�6���5X���EI�F�0�(��|}8+V�t��!,�K��b����}x��-��+,\ei�o�R�[��?=~����S\�=���
^�������T
e�Rw����z2$���T��w;�go����w2Hk�i��|x:�);&/��iT��c�5mL�������=I���������i�������W����I]���'[�{_cbp�7V�w{�6���!�"��i�����9�=�o���SD��_�H#`�:E.����t���������[��\�e���E��;)�qy��))��]�]T�\����b�f������[}$�&���oN��U��s`2���E��������n�����>[]EN�e;r���"�Z���|���/�t&u���H{���&���`��3<�	����Q8�r=���mQXR��d��j�����.��[��EV�����TP���E��p�J��7�y�"��T�Q���L�k7E+�G�k�}�=������|9����%���;_*�M��jCA�lo�
����>��,m������[�3�N��jY�t=i���������}�����m����ns�.����b"���+�c�"�G�E�mr7x+��]����6J'�3K����[[�bs��Z��2)�a�������5���j;��U3���>�j�4�u����l��������v^v����m��cl�k�H�Rt`��F�r�p�:�����kU�_������
s�.I���Og�����[�~����[�Sm>F��u���kZ������{�����V�7�;j�*�$���N�q��i���;
����B�t�}mjdQ�vFV�v��/��'��h�J1�2��<�����)����(D���,����3v�z
���y�e
���G*32`������-���9�����k����H�7�5�[x���4�J��fx|'5�����M�eN�Ob�q���O�o^�T�4�� ��'���2���������Yn����BY� ��t�������rF��Zf��vk��j���������n�q��������Z�����d��b@rO5.LT��g0u����Z�&\HL�m4)$�n����A���&~��9����C���51)�R���_c��N�g������~�z����q�4X�H�PF��zwE���D��nDv�5�W?��ws�o�O����yo8G:|�����+��o����V��_d��*����$Z��(�n���-J y��{�G���j�d8~��~:������
?~>�\��8_�|>8
:��a��?M<��Ei�gN=����d_�>{�n�e|����G�WG�O&�C���Kxq^�|����������3����:����{�M���|��Nrga&���/��KD�X����nA�������>�\y�,9��;��W������b)w�L���O> 9��������g�o�g���OKo����m�
�/��8�^����&J����sjx�������"�n�����N2����<=���o��W������Wk��D>�7��k����\5�C�&�;;7���4�N�W�0�zn��y���g�Q�Sc|@)�O�-��	�� ��������?}��|o��������;������=����j�/_��}0H��5z<�:Gjq����k��]�Ei������
��C�F����D�o(�,|`	�|rH�E/������_�a�^
����XH&+1	�����:��9��*�������(�B���f�p���Q/���I1���5��^���t���EO����e��B���?(�������btcR�h�6���G�.-�������9m_�\��b�R��[�6�r����*d�q�\�S�J����w�G8�H�]\��;=?8z?<�����b��PP	,'BI�}c�������iS���fQ3}��������,�V���P���������MNm�{���
�57_Q�j��WT�����[����M������?}�f�!)2'*����d��B�����>fi�}��H�wN,�oi�������{�JHZ�r}#���_s��&�~��?<�&S<=\%c	-��:n�O_�x��;j���/L]��+���o������]��)Fs3�_rq�<T����e���XG�L{Q<}mU��������[��*x�gzyS�LR������6������3�%�xl��'��4�g�h�����?�����V|�J��o����E�����������xV�����7���L?�?qN�v*
/���fz�oS�8�c��&w��#����q��n���Bi��j��R����~q���b��'O�=s��w���?Y��N�;)n���*�E����������06F�sjk\�U&m�&��79k�_�Fc��sFl��:���������2������'�Bc�g������8g''���r9!��L�[1�e(����@��>��Y�2�mBTk��v��~�D��29�X���mE�����5�����q�:��t���[��/�5����.�/���c
�����cZ��&#Hw7,�V���b�}�Y3b��w�e�0R
\�VXO�{�U��](����B
�	2�����i��~���:���	���d��
.�uS�.(��b��pr�?D�����)�]����3�>���v���r~�p::]��J������/��,BPG'0\l���\�IO]l����{��V>���L�K�^v��n��x����/������_��y�^�1�t����3�2����6���%��RT�_�5iv�	��N8$���*���;�wW���/�N�����:�f����{l��	}��o�h
��w�YR���fz�(�e?��������>^�NO���}�������_~9�����4����|�?9JU���'[��5��1��������$��G�6&�|�~n��"������F
R��Gz���a?3���������^���<+dj�b�Q���W�������MN'�����>L$]�������I����b����K���m���M�l`:��������7��M�cJ����%�no>�kz�������K�n���%W���7.t�<>5&q��0�D�t ����J���CL#�c�<�i����Bt�r7�i7����dRx��b�f����)�IGG��?;�7)b���qB����(�����>|�pn����:s�|Q�r��kf������^8~&4��w���qaJ���-��������)��'u"E�b���9������iV�b���G���R(�>�y�D�������eEz����/�k�(���`jq&�r���]	����wu9���x����vU�����I�qv����7�uBh~�40��MQ8W�7Q�������.�U�3��KDLM_�z����,�����r�W-�y���Z��$��$i������^��^�yE�J_�c������40J}LU����W����)^����B0���BE��j5�ik����]>p;g�S��8���L�(�t�M��_�q�w��;O�W*V�.���R�o����:�m�"�&�e�"�E���p�Td0q�u���� j���(ob���H���{(���	�2���s���bP�9�L����M+z���L�P���o��5��A��fQ�\���KqQk���Q�$���PQ�X�VyVf��K-a�J�p� ��E�]�=�o�P��p���,+��A������;�q��G�������^����F������hptq~>7����b����M�2������y ��M"6}�Tu?&g�����k������j4�z��������:g�&UE��m����R�Q���Y,�1��I���E*M��P�ZZ��������������9�k����yq�`�\AP��.�����[�z�|	��*��*a�{�*o��x:��{��~���
I�x{�&,2���f9co�(�b��P�����r-B���QM��Q�����O�Rh_?0��|LJA��}ca�[�f_[�W~��tJ��kO�
��O0�"cj6���z��I�>�d�rH��{ �X|�w��O��cCu��<I��I��:����P��Y��Q�"�v6��3����(h�	��2��/�L5����>��_�%����o��O������(�H^:S.G�+�@��t(3��l��z��"I�8��@g,�8�.��+�I�f[��8}���������U�����"��4��P&�jv$y^���P*EP���+m��9"-�v���gTZ�S��|�9Z�������b�	�G��2?���Jo�H{����V��c[$Hk�c�`�N-��
�k�@kP������{MHLQ��;.���[*V�����	��eB�a��~ �soL�h�=BS����}����EQ��Y�qm(b��y�4g[O4�����o�(�R��]�����f�y���o/�WY&&4��t�*`Uw\]���e������W����te���U��
-a��+������]���YG�)
y���-��TgMB�BD�vG�
l3�`K��tPym�0�J����P|�w���8
q�/��V�B%��fA��|%/�B/�9�s�����Pv�3����H�P��E�-���&Z!T�IX�he�|�C���Y�P�e�b�����*�����k������REa��w'�o�\�'V�e*�k���^f���fa���t���XRV����u�\Z��]���E����=-������he�_��vwa���,*&�����n�VX��nW�M���KC��4���k����zG�,6y��\`��.1v�)�T4�������N����j��
w2a���*Y�~Lee5n6IX%�l��y�)�m���8_���z��Z�r��n��eant�U���������*j����Hk@]��\�E`���1�s�����\f|����:�����p9[f��Nma>���</�b
���#�bw�y������x?�{��|t^��
�5�����;�dQ.B:�������U�����n�����p1�
��z�?8�<�8>��/����n�/^c�B���1Rx�������������Q�v���U]�u>0L��%����U�x54j��'�?����\g��o�����?�?y7���O��N�������������m�Z{Q����n�	EGj�s�B���n����\��2ez�Fo�c���)31��`�r�p�������{!;A���Q'�+GW+�P�P	t��P+�]�n��)p�����J���3\bF�U����La���Z�Z�p�t��]��v�|��C�bY,,�t��]f�ba��������p�c���"�����`k����<�SP|�������h���Mv�K�W���a�P�w�7o0�%p�~f���9����n�kcg.���P�q��f�0�P��V�^�HV����x;/^��=����l��-�`=/��fn���Dk�,hM���0�P]K�Ju�6I����r���Z����yc��5��}���TVnt6����������+P��s����=�[Qd%��PR���b �X��@P���O�.�����8T�,K�3�������pi6M
�b��/��,I��+��6A������8^����d�r45P��e@S����j�&S9)^w<���=M���X�����UF��n*9��;����%�3M���,�LSD�Ekk�:�JmE����3������K�{�w"���o�C-�����4g��m)5�_�sI��x���6��E�{{����tp���C!�|�����@SQ��u-_Eh��!��Qm�������g�16����#�+���Pp#c	�8*�b;���i0�����lC�l- ��8�r+ds��#���>��S��<�=����s�>�"9u���sP�E������@s`����0\�H��9��$��(�"�pm����E���J���P\��2��#dPm�x
01,X�C�(�P�Q��v�9��Sp
0Kf�"��!Jrdgp����PJG@Q���zOfsh��k���9��\�`����F	A��0�%p�2�N���q6��1BP�C*)�9�9,j31�� ���E�l���#*���1��9�r��`e�k�X��V��AVl��D=6�M~`��������������������4���G_M����v�����0_�^��t5��g�h��pr{wu|3[~|� ��{x{��)y���A ����s��
ka�%u�]U����S<���fF-��*�5DkW2����?+Yk�o���9`}�����T�����M��x@w7���������W���|6"���T��eh��|�2���=+�
?*kh
�zF�����F��v����\��nv��S|�)�f*���)��VL�b��@s�����)'�-���t�a����g]�o���=�H�ZH�Z��>#��������RG����}e����.������fK/5�p���1�C���M��'k!�a����7s�F�y�]�d�]���j��
����W�h��*��O��zWVZ��.�=�9x�����������g���
���g���o}s��pe"�����Fz{1A[�5f	_$+6��|�_yS����U
Y�)�ea����~iKa�&m��L�I[K�]��EYP�lbcY6��e�cs6t�Y������op@n(H6����p�8!��#c���\�����'n��\%ZM�k�8a�����/�/���go�}����Ip�/��<������/�O����G� ��.��/���o����u��_����(Po��������2o$�_�MW��8�b���wW��)[7a�����J����� ��y*0
���;�j�Ss_rb��f\���;-/��P��f��s���z��YH���s�}�r=D��	�:
�A�!�o�7H����nY�1�4��vQ��6��Qr�w�C]{{'�2p�rJ��� [��4�����E�z&u=;��;�A��C4P�8��3�4��O:����d7�����`�����jX��YFQ���d��T�d��(�b5���T5[���5�7�	�R\Cq�}�f^�1V�U*��N�
�
����z{�n�pb|�����s�	'5pJ�fi���� �$
�
[&/�t�(���R�=��w���Z�����L�d���cw�G
��d�X�DGvkH���7,WL�p2�pr\E
�Q�qY<����r�|����e�g�Jm��'������>9z�����������*�&����}d�r�F�����M��]��(������t����,��G	+6h���)��,�Y\|�����?��VQ4G(	Lk�����b�)���r}����*��[�6�Pa~�P<pb�Z����$b�M�}�J����
��T����b�OWIy
��
;���R���
p�d����_�6s���^��y� �|�Q(��d���v���ci5����x��/����VD(�Q<�.�0��C��m\�2��N����p
�Z;��Y;������? W�S�&W�{��5�� .whf�F"��]V�&1!�����	�wG]�S#"�����H�!��W8H�a�]U/�r����2B�2+�kjD�R��Y�R<5"��^f�L�1�Y/�J��J�JM�eVi�Kx2�{�U���\�3-]/�*�^���m/&�$_J�����)�05&���^���������e�{��U��^~5
���4R���q2�r�����j$WJ25��_-�oScb8����Zr�,���c-R%=&VB/������5��X'!���LZ������P�l�$+��
���Fx�-@���iE�f���cd��V�00�������N}@�T��u����p�V��'T������,)���~��/pQ�J.��:	e��������FB��A0�LrD�
���m��+���{
���\f�`�L�x���=r�a��J����kx�q��H��o$��r:����S��2BW�ZdE�fM;��>�����sg�D�i��}W>�D�'eP���'<�I�m�
7zg%�np���U���E,��y�g�Vu2f_e�`�A�`"��C����%W��I��R�A���\R����}� \�H�j7���n	�OY���F�G^�V�� i�XQ�r(�SQ�H�R�_���{W�?�V���
����J�'#��y�YO��S$�Fh����l5��R��A+��+#WTs��-�=��#"�����e-Oj"�m7��Y(6C�;�g�����������7��d�����V�2}�Vm��X��'���R#��kyrL�4K3B0yH���E#���J,��KFf���TYk�w{���=��#�]%7��4Ok]�S!�nO0��_ ��!"���n�&9���!"�%�5!w{��J�r���h��&w{�	�����b�<A-���!������CLF�%!w{�	����|�����'N��b���s@�s:�<`"�<3
�joB��]��m�,I���m���k�q�m�t+"��" ��Q�<p��<�z/4y�uHN�M�J���h�(���L�h��m�x�M�K?����{��3�����y���M�b�o�C4y��sF��EIW��eWj;��	�D*k����@���^����������ZZ��6P,�%��V��(
:�O�t@+��FA�2&2:�&�{���LE����^��oR�~r��Z���D��:}�[��&��$�
�'�c�B�{1�3`�>7�GF�P�u��dF�yC�j��1�+A����A�B�� ����G�n����"�n�r�$a���-���"�:��b�����:�);|�V��N�j��P��]16!�"&6�����t<�m��2�Z���f�}a=HCH�2Y\.�WR6�X-��/V�'��6��D��������Z�8J�2��&���:M�,Q�y��T)E�W/'� �5J�8��ZF��1�O���c�������i��'�)��V�B�e���D~���Ci;9������%M~Wq}�����,*�������a���q�
?���q�M�����~�����Un�dq�T�y%?pG�n���AnFH��p�>8���~��P_�}����@�������5d��~8�Q[  �1�#D�d�#&F�_���~�d�����~��mRI�Y��]G5;j�klt���+��:����.���J&����@J�`�����vG�:�����d-���
u�4�2��u�h�W�OD�[�iS��.��CK�)�UV���3$g]�\kH��|���K�����)��
:7�252���y��Lf�N*�3X��
��e%��	���^�����"���:���An9��1 ��(�9k�-���`Z�c0���d�@�I�)TW$�������
�	���'�k���yB���x�e9��JQ'����a�����h����%��BV
���1^}	.�u�����(�u�����������\���C�1����m����W�����
���<z���nwF��G����Y���_J�,�E�g1B�O������������
�'��Op
rNi�Y��]�lr7LPT�Tz/b[���{Q��I^��P�HF�9����R�$r"Q��{��eN�=��M�$�����WB�4P}M�dL9��dB��B��z�D�(x��n���"4Q�#�(>����YC���4����5������N@7�	��O�}����F$d7
�9S��"���|��&��������������r~��*:s�z���2m���L���:;�:��8?��e_y	�X�3=��#�)��T09�x^$�-2`�E9_�P!\��A���zj{���iJ��{�sA�����@N��8v�s�iJ)�;�� "%��+��PR�O�
��<�����J{Y��-K8�u���ua�J������!1L+�"1LX���L�x@X�U�����\_}����:��?��q���q<[n�����
��WN�/!��%+�2�;k%�<FN������������g�������P��a?��H��6�T�ug�����W�s)���/C�_R�S�������|��~?�h4?��=���h����cuC������Q9�e����7Go_<�u�:{A^���r�{��6��=��!	�&�X�/�9�3��4WzJ���%���7F�C��?�j�
�N��G�����6�gd	��RJ��9�-�O�w�������o����
���}5=�-��R���O����;Z[�E�wt5M��W�C�@���^WwT�8<�h�������v2����\m�"�r�'����A�d�Z"vy�L�:���&��U��qD������D����$�v����� _�������.�n����~
,���Wt]�<08����)?������$g����`�v+_Cd`�}ch�0�O�������V ����u���p;��VYp����\�Z�
�iV�u�
�o��|M��/�W�-��C_[���n]�{y}�#�S�������MY�i���2�n^��V���;��Y\��/H�dU�x������?V����LY���4q/����I}�YJ��
�!���
�i.vd�jn��d�	�V���� �BT��]���hC0���\H�����/fW���m��x^����1tu�k�R�����������5U�:[+�ev��-�o'���Q�v������Bkp6��UU�<�����9o�����=heu�P��.�mU��>��������9����b<�XUt�R	�D�������qT0��A
��x�@��f�%
������ �������3��A�&|�Z���+���d��P�����]�{��i�U��~�����;�,z
���K�J��5�(�l��ir����X�
��$PG�#9�!�E�r��j�E����8X���h��Vd��FDx?9�T��������v{����l�����o��;���??Y_�S���Z�N��D�J��y_����N�/_�@6&�]���k�w���&����.���fM�.j[��(����F`���,�-�������^?{z�������p�GJ������8e�;���v~u��>�)q����}��������k���]��������h�y��p�>~��In��Og�������+�����\��|���g/&G������|��_�\/f����'wW�O�������������d�������������y~���?�%���/�O������W�^�3����Q�����NNg���_�Z)�q7��;-9a�S���L�_��?�;Zp����+����X��E�'~T�@�l��^��r����{��EY&8������,�4uaT�0BP���`�$&��j��v�Tw�E���g��=r�i��QZ����EA(m�6��p�����a�v�t�-�*I~�
�B�QC�0�mn�P��m�5P�P.��QC)\b��5Z�D
�����8dRP����Tc����8���YW�d����hp���d�k�����������f:j�V�JC�P�(�0b�:D��*�����QC%jp�P��V�lu�d2��,�F2����H&�5(�W�GI��E��!jw���B���Cw��~?����9��_�O�����<J8��N���UoV�f~�?���v�/^?y�������mB!���h��d`��;1'2������8V9���B����Q�Ej�������j���j�Q��9����zci�,FbwX)�B7��yt��~s;_NN���>�����������??Z�+��:���z��r+��q��x��M�����j~T�i�������}�C�����dI�er~6�>�]�N�E	���8��t~qq�i�'��=}{�����j�����U����j�z�a]m������*�t������20h�M�w ���^�z3����Bx�I��6������������%U���^�I9�Z[�u�%�$��,���D����K�_�{�=�	`�EJ[g�oij��F�_?q��w#8b�&��b	�#	~*�������,�R������1�*��t-�Q������������,�$�����h�'�M�[Q�gg��/�5�(��[��5[�qxE<��`xL�:R0O	7��Vy?~-�F�h1:����m��g���s�^!��Z�ZJ����i��.�#�������:�l���<J��HB�u�y��$���o3GubM��,h6����f*�^���+�;
�W�6���UV��t��	aZ6>U1t��l �f3S����J�.{�T$�Yl�41�u��J�������������[#�S��)����A����PhHN�0X�=�*�g�{�����;[���X<@�������5&s�=��4��`|3�._�e�O&q�p0�����fi�v��-���51��G��#����|o�����r���*N�����|��@~��V����N�M*����o#�``A��=n@�yT����Z�v .3�/��c:fU6��F��_�0+ oLH�j���3?��+~M��lE�v�tD���cW`#�K�%m� m����:!�K�\�3J��8iG-�u�H���'Re����I��e���st���5�K6���=q�V��{�[�|��q\
�>
�E:q�`�f�W�E���R���R�s�&���)U��9�+o��dq�\S�����������v�a��pi���>�Vb����N��X��u�"zc�����<g>�p��/�<��J/f�j�>���k���.<�R/qF>�w��>����HVH6��)���M}�����U�*��-��t�������|�%��go9�`]�?���IM
&�b��[Z���<�;�7����x������)�j��I�y�����7g�p���*�qz8= �Gj���	�^q����-#���b4Y�24����������d�l��W�
{+ �jX�#��X(km�'�+P�8Av�ON&77�`v��������G�T�7rsv��~Xf�:t-M}�M�m��5��TD���Hv$A0F����R�W�h�j\�f��~E`����1L��=�����%I��=�I�')�H<c��DR�f�\R1����3�!l�o���;iu[�nR��~ �C��Y{���H����k�^AJ,(�08x�/d7��L!;�����`m���rL�$��J�<��7�$��J_9UB�%��!.�(�2�����T��Z,��2��+E��0���!��@�+_/��n>��J���c���W�N#'^�M��d��x{���3H�f��%�_����j�:H��\��^*�f������^-���l���5[����%�������{�i����m���)&0ch����`��o�_��,��^��S�M+�������H��f�@F�@��E���������8B�];NK;HR��\{��v�o�D3QD�'������C�\�y����+^Z,dH����q��������D�V��*G�r�i�e��r�N����@�U��w�SC�������s��0�c(��yb�j,�X.W���y��q��5��LQ�SW�.�� �a5K��	+���-Q;�gW�������>�"n���3J��t�Qd�C���k$��rG`������7|��~�3�S�DC�dT=�!/*��N~E��_�S��f�y�"0����7i3�T4L'��B�E�i�'�|�05�4=gN��6���T�7��0
����8V�&��|�s������L/�5F�A.�v!����	��x�f
{�ATs�qn<����R���G�;���T(	��y>c�e�������b�wg�U�U�e=�*��5�V6�����sVA0XU�fHc���u���v1-���u6���[Q]��$@
�a~��_/O�����1��������O3^J��9�`�9Is5������F|������z��_23.=<�@g�����a.�/���c?����������(AH�(f���Y���^�?'J��9.[a��A��Yh�`Be\	����
���X�$��
/�z�����)�`i�����������8���S��*�T�o�����@��]��r���Q4���.Y%�K7E����/��!��qg�������S�u�����q�0�|�L���=z��H��r{7��s�
��i�
w��^Ls/Hg-$t�1-��:���jh�/|�A76Vw)��V��[����R�$���j�����"t��Y���}�*�z�.c������@a��skW���0����w)?��A�����)��\��KY�����>���$���EL���nz��w[K�K�P�%H��G��U�)p�=.�6f�%���GT�����R"��E|�@���2u�vR/�B����k��/�q�b^L�G�0�.������s�C�������,)�����j_�!��8Pr��_�^�j�M����^�&}�g�q�\�I��m�������' ����������)�����r��r�F3'������"���=BV;:t����D�,���<%K$�x��-.�=[���r�LzGW&'V.�TD�
��y��K�O��E�%k
S���/r>}���o;	�~!��c<����/I.��r�!�6M������3��$U����ci\�	oC��B�q������,�?"���#�2����0Cq\2��u�
���FY}�+��:����T,X�\�n������1i3��{4���x��6�;�:��M�����E�`��u�]!�VB���� �z���K�Y�+/�:����{j>����f������t�����s�tQ�V��i'���-�mp.x�X�j��?y�����_��0D�v���RC��J�Px��i�1fU[h����Kj���@%DfQ������
��i�yu���9`��~0���S�.�*/g�w�xs�l:��>�E��?|:�����O��n��'�y�DA(ZU���{�����6�?*�,�%rS9[��7^=���?��>l���7����?�9<n�������������������]��#���:nt���l|qqu2�N6d�
��!�m�E�gT��Z�P�
>}�����4���4?��>���F`���������[�\��Jer�RcmIAe�n�,]W�r�.wPz�Ch�-p[9�h
K�`���Rb����o
roU��5>���a�c�h�=4���������*x|�O�a"jT�B����cR�Tu	c�j��{�?��F`����������7����	,��e� ���������n�qQ.����y4�oN\�������}��������-�g�W�*;%B���i��zL����l���HQ�������r�o�gOk�z�yS���o����W��now�}�N������+(��,�����&����C�C6���
���W$$������l�|<�8E����yRm������' �v������y�oks�p��7�K �z���yv3��L�c0�����8V����6��G�j������������'�UyO|~��6 �j�`r������������t��	����V�N��P�:&����*�g�cR5o���q'$a�Q$�[���)T��K7}r61^	������;0ZL+����(��b�A��@b$2�B(Z�g�0� �����~������`}��c+����=F5�o�zr����|!��#�����~ [b�)��SO�z�8����T���C�)�p�vd�0&�4'F����H��
��^��g$�,�6����bj.8\����=�Q���c�W��C��U^Y&����
��4TU�/Y:���FH�hJ��J����s�QI�%���
��/�P����)�5|7�>I1a/�5gq��F�$8�g�b�;�PA�?V�b�����]��R}�=bY�����N7f���'�����{����w����_�����1s�9bIrL|����	��G�D}��x�(�h��1����dz~u�:+Z����#�����F�D�c�_�Uu��}����M����o��}
�=�6���������Y�kn�3�QK�%�+D�����0��������
�������T`A��,����;��;L$'����0���;>��<|	�:^��=��g'��^`)G��9����)���������6��>���gU����7u�����5�����\�O�KN�D[��M-5�9��h��rf���I�'+�mlS`��2�������V6��pk�un����G�j�%�Ip*A<��/v���@���3�<�6w2��c>9]';�x��������o�5����6�%�8{�����TD��=j)������lM��w���<����������3d�� �\B	w��Z�.*���kO��*y�� o|��`���lV5�t�H��Y�{8�V}p�gk\/����B�������v�i����v�P9�������^|C���6��]'2�^�
��S~9��r�v4� ��e�����9����6r����Qu��qEl�#W������d_��Z1(/W�;D������n�q���4�/m����0���8��}��n���W����?8B4�K����}���p/<�i.����������E�d��'����OZ���?U�/���>xpDJ���@�d�������r|�Vd�y0����eFa�h_+�x=��3�vAI���ya���'w&���
�����V��lD�)	����O��[���������{oP�E��OJ��k7�|T������}-�y{�� ��On�o/&7�N������k��G����{Zu�_����8FG{_5�1,����5�p�����a&��lYp�q*�W=�6���8�Rr�l�Ev�H�,�(���?8Z�$>Zxi�*Pp:��LJ\��TZE��/q)�l�������;�����`�y��u��d����VH���"�P���
c�4���0_�?�rjeY�&Y���{v�*���f��=ni��3*�,�gK��u���*!A����/�0X�w�%SU�CB��I�1cb

��K�������FD
1w�������H�������l�H������^l��-��v)@�
��z^��
_���P��:r�u2,��U�6V[&�����e&�ZFm�B[/����,���Z���V��-��h���M��
l����|#@L����$����PT��>�T��A���X����ldo���D���pf"#'��?�������,�������X��	=��u�J�@t.�����i�gL�z@*#
|t����p�J��6V��D�UbG� fN��n_�|�v:A�%�KTY����j�<0������:�f��<�W���[W�R�
Y^�/m��l��e��aJ��k9�`5#�{`_+�<���@�D�����J��Q�I�R�k�C��*�V������O[��{�q�T~aTx����-����Z]i�T�.=B��0t�,I����dS6.���V"G�����\��s������O:����l���$HBJ�Y-���,HB�2�B�uh�=��S������j=31�*|]
\��$pxx_D=19�&o?��L�Sc-1&��yV���O�����ifF#+���.{�,Iq����������A�qU�2Y��#:z���'S���N�~lgt�6<xa���-��*��?�B����S=H��=������O��kge�E�����Z�z�4%�ty������)U3��	�e��[~��w�4D[�q9}�F����Pm��'�Z��BR%���/���1|���R$�z>������^]	�?�A~���q���c`M��bJ1,���>���`\�@���0[��^g��=\����{*`h���^GgwBI��k������UN9��1&=��J���U�3�&y��q%CO�./��/?N���wxI����+�]fu�m��W��O�>�h@�R'R����(�m�{+X0l�L=M����D@��������vT����J�O.r�����aN��T�Y����v���.Zy �\���=�!�����U?�����Q�k�I����],�h������������,'y��S�Y�j����[�,�q��Q;�kI���jf�R����t����49>=;��N��n�;z��j��@AY�B��.S`y9r�����`bQ��}���T���x��p'�r��)��h���\�����zj� �����A���������
)�ur����*,���cY��\5LI��^Y�"��;��a`��w�����h���t��.v�2�}��s���B������4c��RVdE�0^{Z��K��@@H	�k.���PV�Ib�~�D��CS����C��WTiV��w���T��(a�^_�Z�\�������S�}�r�d�oY��~!E��z�<��W�kj-Hm1�Bx���eU�d�Hsb�y�L�X�Z,D���q]j�� ���1~\D���?�\�Am�4*I+1�W�����\������}**_�x��<������60E�e��.��RQ^���|kT
+�mk����7��UK�,a���j
Pg��
���u���\g�'������ Jv=����W�Z�"�B`,�8P��f�l���^"T!��2����������8�$���S�j����d�}��f�H��YB Kc#@����B����~����]W�TW����i*+###��8l���QP�`�z��-������7�>����!�����O�������&�.�R�Zt<�+�R���C��U�-!�<&Q�G0��9���F�F.����W~�E�M��?������������
��{����c	�*F��{��{�]�a��[����;
F��m��F����E{���������.K�v^ou��v,�/��������q��t>�b&�{��]�	���l�h��_(Z��kZk��/ieYs�k�k���b&me�MS� 8^���;t���J��X��7��jC*q���)��S
oS1�N��AM�J��j�������&S��g��sxe��3W7��������c�Q��@�U��}+T.�4GoZ�A���-������ee`5�I!hb�LxV��B���� �5{����e<��,
�2{V��sU7������0o�lH�U���*��d��������\������_{��"_wiv���A']:^=����F��3��p���x��D��]����V����@�6�RZj�WqE~���J��^����#��~o�B�|��:A�J�in������O���Q69�]m�DZ��y�O)���B�w(z������#��w	B�����&g�v����a�	�������u�+%z4vk[�f�w[��Y�����*��g���MU����&g�����mm����q���#�w����wK���j!�y5=��`����vN����g�~6�1��-UpN�.&����W�.o����oN�'?�uqn������b����	v:x�trrx{^����9�T ���!&D����^�� ��n��������.O�>�^b��s����h���_^}�b�2���o��k��o��r�{{q1��_�:��:�L'0���=|�z�7�{����~���|l��������o��*��vy5��.����uW�O>���Ss�7kR�����E��]��Hf�&�������}�6n����6��b������_�RC�.�����V��Z�����X�<��VR:g�c��XN��[�
-/8���
�e����O�9���S��eT�e�����H`z���xaZF��y�'�H��&�9Q�^��j��{�(�����z�-��JZxwh�����:���7E������QxV��F���,���������z���$�{�e�R�g��a�1��$M��q+z���Q�I��!F�.�5��'�.��b� ()<�4�o�������
��������xV����`���c�{�j�0F�!�}�g7�~�����eR�`U�,pdL=!�}�}(�w���
����W��Y��y�o����Y���+?�w�I��:��}8{"w�Z�f�vA��@0���1+g�.1*Qq(�~f�����X������Ok� ���a�Al���8��w�#��SIlq��
�Q$��T�n�P`���T��z�1DR�_���[m�d%��L��=3)E9�E�� �
�|�_�S��$���%+��Nj{K��R`�\����F���
J�]p�L?�+M��^�<%�@��W�����F���}������;N��L[N���*G�3����#_FR8-|����:���-U\T�Eikq����7��f�#��a��
��n��>��P��4���4o+a����v�r�o��U���8;��qL��)�;���* )��
jg*���b*I�6��yJJ��RZS	��~!�.U�T)d�K�,�n�[�4��L���K�w/8A�:>�	4e�X�f�R,A�]5����U�h4�7[i0)����10��c����������>��W�n�E�������7���m;U���2�x����d���[����
�`�&����[%aBR���+C�=�k��g��S�9�������P����"���� ;�\�oc|��A�y�mD����7[�-�"�c��A��=[~�^2�`�a+�Iq���&����\#o�~*]��tJQFX����0/�g������
�c��7T�~������
��6�A^����mz+�n��e��Q���R���S���s���K��6�����<e\G�2N��������L3Q���}iI�KBQ��������b^���F1��$�D��M������f�3���s>����=��*b[K��A����[��v�hUQp�F�Cu
AF���X���\�f�2�0�`�a�H���
�������z����;�\>B��QD��e�%<ZV	�;3���#��]��A&Y�$Bj�U_V��I�
�6��
��[J1�ppdzDH48)�\��PF���Zs��1����fW*�������r��3��\)�z<
Q
4%�U!���Z���"��R�����L�i������\	3�s�����u8T�'��_j����q2�@����U���kC%�]y����x�U���AZ9�UfH@���������R��hc���V����Zl���"1�n%�����R5���t��������-�����;���0I�B�Xoe�C��"H�UIP\��w�1��@�X;i4��m�,lp�"�Pr�j(�<l���JFPx=�u�d@N�*zi!�]�G	� Y��&��+���6���w�LSM��V��:;��9I*�+���2�.RtU<d����Plh�Q�`����+e<����f<JC

�.��IK���.y��|���U�]�U��1I�J,��F�RtU��Ve8�*< PmW������"#�(��!���$�A���9���@�RtU��g�d�k�.������I*�.e%��Pk��=���0�.���f�����MphiC=�'�^9m��e������I�%�6��$����T���t�Zu��2�U**F�J���4g����U�`�x�zZt}��#�.Oy
6)��(fm����K��!�R��iW�H��C��a����,6����������=�SY5��ou�����9E��Z�8Ux��5#��)�.#C1J�L��4��,Kr*��!��RTMF��u,��
2�
�) \�-h-��40��!DI�@5�jx!��pI��[%"DxR�����k���i�[�
&KjSK����C8r��CU(R6B2J�qL�ui�U��S
�b����
����/����o�����FP����:�M^�6�
-�f��[6\B�����&Go6L�A$w�
�L�]���u��48���������=���J�
"����7���QZz����R���f����i�`�����+-��4p��g1I��B�8xu�c����]FF�e�Z�y�A2q��M2ql'�Q2qcG�%C�{NS8]���T#��9L�u����J/�tUc�5�����t��U�iH�a�>V�v��]�Q��k���J��"��y���zj�s���:��X�N����J!���z-�W����dE��7l�HP���u��!"i��EB��:�c��,C���)�.LEe�O��U�$^��d�<��B�H��a��+X+�1����F����,����	#5��IF����R�f�`����$F�P�U�j4W#(�9L����Y:��hQ�9�eV�+�W�t�-�M�6��!�iR][P�,��1j8�b�2��}�5��!+��Kp��=�.�4�L��h����it�}~�QJ��h �w�9���{o��]��`��ud)2��l��
�Z :�F���cDc�0%73\��ji�n �B�Mr�M87��A���tf�C����p>�)�`��K��V���Q�ygvg���]}�������CP3���]��,���[;�.�`����d�������`1�%9$�������Z"CU��X��5#8�s����F�u�=~������`�%��*K)���o��p�u�>*�����Q����B�Ij�u!��R�c�����X�:�Q��O�Z�&��Z����@L(�����X�F(j��]�	R����7D]I����]��c4St]B"`8<����L���a���Sb8_Cmj@��Av
���y�w*O����������5�Y��cqh(%�8�����^�U��AA����
JmB��GK_>��F�����hl2��J�ZmjFh���������s0�9;�9�py�h����9�z?�/^�n��/�m�f�~���p��W����n�a �����dz{}�������G�7�������ME6wy�N{�8G���?�1��l��Sa@m/�z:�E�cuCy�l����<���%�nj,0�Uy����z��tZ����t,�`]�6R�/�~~���p0J�]`�d	t��R�^��0v��"�b��P3R�@nCY���$���������[��t�1�a��I?ZV:�|���oy�P���	�������
o�
+�$$ ���c���0E��BA����WKl�7�%Y�����#��=�t������K��F�,����Kw��;��lz��f!Dm������R)�������)2�	�lcsko'[���N��o?��7��L'������)�)��'��?�����B_}�������������r3=���<����]Lp��zm$����h_�1��u�;��k��V�������9�n������5������j�yp"���z�l}����������0@�w�d����]`���"�km4�����d�,1�l	+�1��������''�C{dO&x*H����3S��b�	?�5����o��R��a/�r�~�hS��&�
F������K��R�!n8�=ixH�����X5B4FSt]6c�� ��6Vq=��>W�cQ�bA#��4�9�������\2��hc�$�?M�E����D��T��]
%rE���
���	BT��*Hi���a�L�u�P�������(C�����-�+���g���=|?����������)��{�Zj�c��/2���m�$��$��������PY����
�e���uprs[��[/7����������4;9;�dW�����������/�0�@����W
��1@���,�}�'5��^�~���r�`gc��o���0|�g/���w6�[���
?�o��z����]���5$,�ZL�7��I=�W�+�l��R�����B�Y=$)]�,f���6M�h:���P�Z=()]t�1TJB�4��U������w���4^P���4
���6�%[K����2�CJ%����H�JI\k���R�`QC�����r:,)�&9�tN��g�B�K�s���sR�)��t���g�����$��b[�a��9R�wO��Z�\�1��*3r��U�����9V���&;��.:��&-�����+5�2��/!�b
���<�vY/}IcMQ��s�z3�2)��3��N)dQ\A$I�m��XW���z0�@uH)=���!`I(%��A������A/)9�/	�� xqI�QX8�Y��	�HZR{E�32�/��������3d����r�����$�"���r��we8����U7`���j��:�it
,0	)�s��Cqw�����*,� FaQ���,���}�A�l�5��&0��IH��}��g
������
�4��(#�`� f���p;)��E1��Z��;���_Tl�/�����L�-�f'�U`��h�8��,��T �b��ec^f��\T/�[���sau0��������������������:8H3�����S����Q���
<�Q�6���E��9�1�>�0�v�N�(
�u��)Q*����!D�w�� e-d���1�3�B��>�n������9i��=�oF8TY	����Ds5��$�R���."��y��D���+$�Zhy������l�ET�=�Tz��YK �Z������&��11w3����	��w�����:�t�Q@F�������L<��c�~��$b������4�G�6��|,������3�O^�~9���
�8��M���AGPt%g%H��:P�&���RV<I�2%0�� �L4���Q0t���+��D��u���4����#Y����M��7e��b����/m����i�|�o��`�����ry���zS�8�}SeH�H��q�N��j+�H�b�XI�ZP������P}aQ�������0>�<%ua����������<^R_A���=�%!a1��d�,Rq��BE`/))M�/i�q��QD)�D�Qz#1z�K�5�4CDz�$�,�y�T���x�/������n�"�[��?�$�#(fjI�X����nB��2�4J���'�xI�	GP�&@J����^V�\���A!N��j#���l���q���K�6?��`��_�(��<�H�t�~�;b����Z��H���7F�^-o<�4/H5F��	W����
��H����q��YqjQ<��X���h@2���]C�AZ��(�t��G��������7��i��������
H�Z\vB":W��7~P�M�l����	N�L�gX9�zmso���&�=���z6=8�j&?�?�u��������[����<����N+��� 0.����+W�����������'{����}/7�����A�lB�{��3pW��������|�������'��$�"����5�
�>�T������*��'���I����Pm�P:,�U�()��hp�-&����!�U����h	J�=`j�D,��s�o���rl���)��t?��S���)Y)z�V8������h��K�,�Uf�T����
A��ijY�Mb���;�IOH������6eH���n��m �Uh��`?�MU����P����x"+����
ZA�dK��	���p�)[�����!H%!�	a����w-��C�hT�IN�C�u?I�������v�����
vp%���s���	Gq�R����5_U��1$"�W,C5��x��lJ�P�Z�������^�������*��p|$�F�_:�{t�@[���J�yAh	���X�n��UQ���\b����m�q��K�JZ��W�.P�l�+����3���|���}��%de,.�,�X��7�?�$�hH`������&1��1�a���T�����,�eU���U'�����^VvDaQbn��S�`
�6��1���0�18�3�k>�m���LQR�����6��tB0%�����9X
�G�:m�tp��B8]/�(�t����R'�V���Zi��TT;�$hIW1��!����9
[\�
T����Cy��K1�����i��gI�M���1���l�De�@��������9�ya�����������[�I�H9�u]�VN��-9����'j����X�B����"������[L�,���`��Isd�
(��J���@���O���=!(%t�+��bzL�P�;)�L)<���&C��5�g8�i�B�:����L���~��9M9�ml?�_���CG��[��3y��9;�?��������X1������-J	�`���x�<�2:�s���z9���a���qS�B�
1������>��5:<�%�&sN]�MnJk����c76�'������r�-�~�J��m��w/�Iv�&kK�oe����A� ���N�lv	��.�"���z��V�vI`���.Un���C����/�G�h�����n �{k"8����O���r�ReV.�&ySq�Rw�[�LTE�\R,����G�W���P��������33Ds�!��)��J���KXgitd����a0,��������z3=�p��6�~#�u��l��<��;��E{	�d:���.����:�w����-��g�����0�}F�&��;��L��`�]j
��	��&(���U����H����uSRR��&B,�M4~_RD���~��5�
��A���&�X��H4|�;���U%_�h�E��8�(_s���7��c�yF���p�^������o4����3�#���P����.��i!���`��D��������_��A��PkY(�%�k�^�!7��Y#����I��J�H�V�v�,�$�
D�476��h86�,�(c��2Rp�����(K+��j��g�=�Z����1,z�r��RceHX�9�<n�aAu�S:3)���!�B0uk�U��BE?A��� ���yNF`'��������}��q�-2l�
��tx��I��R8���T��1�jdXg�H�W�i�eXj?[8H�MP8%�`B&�H�3Y��:F�])=Q���h�&1xI*���N��)�.���Y�J�����t+����2,/?#\����S�
�G�%&�����c������%�q�s��B���detZ%�����jp����%2-i�8.��*Lx��r�5����6��u�pJ
x�s�b�2�����5>+�A�����t��0�SQ>Y�,�e��x�y[��������!�����>���Xcx4RV�V���!r.����U2�	��Ta�}DG���������Nl��<��l������cfd`]��x�L��x�Vj?)���)B���qj���qJ>9g��	@.B"�>9������0J�qh��(������]�.�����@�:�l�$��t��aR-�AM��������V����&t���1��<��B�$
y\���E�T�P&.�����i�\R��C����,0���("�QD��T<�"%�3�*��[�t����a*���%3E�p��0E��	��c)�X�}��=����B
jY�
(��akpl��j�����G������������#��{��H�!;��b���s�H����e�t��D�@�s����#�fs����?�%��o�%�]�~��mA��}�	���[�)N�N�ZE�-����HHq��E�BY�RC������YD�{+��A\e�4q�CHa
���2��~T,d:�!J��j�,!bf^��HH�_��x���CH{��a,���E��F�b��b?�U�mO�D�!�E�����Du8��}��4��P�'�4;=��t�4q�c�P����Ba��
J

��A��z~�;�a�	�DA�@�����7��I��G����.�;�?�]����_��?�WF|�zB�4|<�1$��v�4��}&O$)J�~����dD�i.@�O�?����@V�Vw�o��3�_��#:��D�&��<��WUXX�9��"�L.Q-��f8�	F.�kC�J�qy�<�-�d���0��1vG�5
{{��-���n�kwR�{����Q��X�G���Xt���S�l%6y�� -����x�������U7��|��e�L�U<C��1�*���d����4v����[��3�_�!�i ����RO��'�2in���HHq0�����b����`��7�j�6cCy����I��G��[H��
��&���B��x
�����?����4���R�@�6I���Or��cR#���R*�v����,����aG�(V�M���U��;��i�E�x_4�E��Fo��n��a��s��t������`�����]��bt��������J���K�'�v=&��B�W����"g�R���"���� ��Hk�����|]?!F���JL3�N/���J��)��8���RZ��e5���2I;x�.�?�~7�n��W`U��4���}_�AJ���NkZ�1Ni
���4�`�]H3�$AZ���+F�Z="N�����"�����C$^��*jZ���~{�w=kc�pF����BS��oJ����� O�X�1x�]<�_h����$�/����\@O�3��M�� �W����y���I%%�`����Q��K��+��p`�������z��"��J���u"��}�
���No
]IzS	���T\�x�t���{Z8D�G�))%
��@�N�
;��a����ZIP��������DX�;���!�Pg&Jy�<��R��Fr������JsopC�;�A�5�������;D(C���'�!�������V���c���|�
b�y�@���)W&X$|��#X�9L�uYJd�M#�m}��So6�Zu���Q`�x�:I�i����4���0��. �U�!'��B��-��F/=�3��WM����C_��>��6�0/����>��b�dG~�%����E]N�u�pU���OIl�*p!��zpc�79L�ui
��	��Yjf�����_!d����cI	P9�����$������[��� �_�`8K�������? ����%�qg���d�l��*� �����PS���
.��+�����C.����k�3�Q�����q��P���N'�����C+�92���_�m�9x���w���������o�<��������{�/6!o�z���?{�b}��������24^9�7?GgEd{�Aa�Rx
��4��h����
b�o��/.�b����E��h�%zPR$���VN�3�_�6c)z��U���:�pV�3��������|��c�b��1T���UJ�2�{#�<^(�R��"�<���R�,K��xot�8�
��Q
p�g��s
�-��=���)��W;H�k�i���Cc�5o��c��_���� �|�����?hZ��
h��V�62Tt6�!)�p�:�bz�Q�u	��w�ej����&�X�6�\��.�#l��)�.�Q�a���*����II�]���sV���`,�����!��=����I�0�V��]	)o{$e�j��WXc�Y}��W	��xac��J�w=����g�����xMT;�F�:�t��:9L�u�!�m�X��;V&^i�
��\$��/�	��x�M����c����z���K@��cd��6)��+oT�+)d������47��$���6H$����k���e@���I[L�T4^��hJ�	AM�(���]p��x��k����R��r�ryk��z�*ff@#L�`z\&��O��c��)68���{���$�?W��&^�X����"�����a�O���#J�\���~�G�����P
����b7�n���;����j���^�����(N���Uf�������c�����3z��XP��Y���%�W�Dx��s����(p���t���c�Y�6b)Iki�l�}H�k3���@�M�e����YJS{�L����~-��Ev! ���"�7~@dz�+��J���������c��{#1"��VjM<�����7�������|���U���CrBSl]���`�/�X�M�6���?)�U�@������a�������M�Jh�Z�
�r��5��A�����
���R���������T8Af�� ��hH9�}W�q�/��[��������
�I���I53jc�����]�c�NXI&;�&!���x�DN4h:��=9Y�s��k�n[$a���nJ���]�\B�A�����r6B�����D0���]�����u(J p����YeY���0������L��4��@����Fs�$$��������?��l�&4�(��%N����E�'�������HI�_0b /���$�p�����z������T���M
�c�L��=��Px����'���-�������[(��Nw���v{
��^9_�:������U����C�\�.<����dTW��:*�v�IH�*~�]V���
�S��+P�_�p�@�����x6z���&s��#��~9L�u)����9��@��g7���
j��;�(�|�y�q�|B��m�@�
��K�Z��mF>�U��1���h>E	�2o������
��uX���F9�Ik��}�h�\���+Z�	%��R����W���;4H�����wM�f*��R����O�B�6���7�m7��b��I���d�1��X=BY�����IiW��;���5�
��,�x��R����K��>�q��cO����LRF��C�����
��T���E�n��4j?P^�����������Q�O��#Ug���c6g�}��h����t�\iVX��PA��J�'�o��`����f�J%�:lI���{5���]�\�aF��A;���R�u������zncJ����(�1I��������|�cT,ZER����prF{�	����m6f��,9L�uG�c
}c����c�kz��x� _��^	X�h���%�t�/z����o�Qb>��.c#�|�9L�u1�P@�>0V��F�A�^���^Fs�ufT�u���LYG)��f�HO�M�G�D��H�����-��5���]<c:��HV��.*��ul,�v�y�v���K�N��}"x�����/�����K���(H
sl�<�s0E���������*+)W�e���=Vo�\�r����3F��"��vGc���	��������,Ru�3��^�Ac5�T
�U-<@�������Q[�`�JOTO$:����0���L�������m
,���<��i�]���r"�������%���T����U!�t��+FFU�a����B�����3c����w�^:x���htSZ���.��D%	��a�m+���$�"�����}-��v���D#&��i�����uV ��k%���@1����c�IL	��\��!<6d�X\#T��a��K����b��*�����BFb(P�b��(c��"��aOt���
`�������
�,"��������i[|��xM�������x\??�be�����F7T+�������b���h��q�G|�<��e=^G�sB���3���8�s0E��C��%�wlbSD�O~����k������F]����gt�Y�bv	M����T����Q�b�^����_�*�B��zVB������9W����vtkB�Jx.*DP�>���_����e�
m����l�E���R(�2L���!L�uI4}���;��4m�c@[�C�O_;��	�P���������%�?c�'4�_.gE�����K�w#:��o����c�Y���jJ�0��O�������(��U�p��@is��|l���K��P@�nwM�r��&�����Rz����)��9�������������u�|���
@mUIEQg*���$"5� ���X��K�JA���r��5�Jd Z����o�pN���c�{
���z��R�Ot�P.��1�������������f�e������G��|~��+@!@�u��Q�{�����1�#W%C ��w1vL��nH����LU2I9Ww�cm����?n�E��!���{Gs�ok����O�	���)��
\���Y�1*e�0E��'�.
r�'�Ac�R&���?�M	u�1O
y��+�t	-LR7���oJE����Dm���Zi@Dg'���cLQ��t1X�Lt����;��`��}>�B!��^Xc�����.�A�Jh�!�e����#h�m?r����PB�0o�;�����.�����JA����KB���)�yy��K�I>���x��cl�q~\��.nh�EY�(�"
�we0�^�/
����TFF;��!Q��x�ow�Q�h�r�1�]�?���iP)�~�	x���LR:�5dSt] �B�����T� ��k�����R#x������.���%�8��x�;�"erQ��.�:�����O�(�q��W����T�x�S���ak1&1��\vwvnFwI������^����������E�zR}��t}o#�.a��}�����)?�v�$oyD	�Bf����O��.�EGVy�A+,P=�_��f���onek;������I6�\|��>�>;�%��<�������|��rnF�\W�0�u����w�T�{v^���� ����TT|�o��p39�O������������l��@h^<0��IL�����XM�2����%�FI1����*qw�"W]u�YJ&Ub������\�[ F�,�f���� k=rN6A�W��D���`(G��b ��Z�k��p���I�����,��(��L�Ra��QaZ��H�B�����e8���s�?�%����:�K��7RCH���b�n��t�y��K)=L���-s���je99�����Jbo����05��@ heSS�`��C%|5��:�����
�b�)_�6�!�,Ka`
�X��=���zF�Nz�e��&J��X�t�$	�Hw
�_K�r�V�
�0������S�@���q�l���X�Q����vG-uC�����|��xJ��Xm���
*@q�,C��to�)�jqc�Q���M�@d�L��M��f��O5�A6�����_�x3�jE�\9��������a3�B_��]I�R���fx�U��!H���~�9�T�0������?����I
�~�&��N�T�Fe��&��<J,L=N�z�s�7��3�+�0�uZU���P�u��i*��j_*�E�����z�R�S����eoZ���qx0�������xrf�DG��f?f�DuoS:�J0~�n)�O~~xsz�p��f��z����_q�2����y����Bf��������_�Z���B���
�*��n�����������/o>���q����b}�����|��W�{��������v��������v���7�$��.{�rk/�}��������P0�[�c��A���N������5E���lQ]�?T|@�l��!��..?��O���g�;/��Os�^�l�m��^�o�>[��x�Qg'�|�'�X)��L�<��������E�a��s�O�������6��~~�C#
P=�.$3{���[{;����tr}}�i
���/7��Ev�uAZe+�2(
�|5x5����~�9���z4��~�_����v7v���G��-��h�_��
�E�
s�>F��z2���\����#��;���nt�{����������a������p��tv�X'r��6W
9,���%�o��.)��e�Y,�f=��������������������G��=9?�p�������g����X���|�[�K�x�<{��fz}x<��l�1���9���^ �����L���^?�q~t�>����=�*"YO�E��� ���}3�\_��L��n�ke�&5�����E-�
�s~��9@q�5�*A�$i;?�x�[�\��(j	LN�LL�>ci�-����D&Sq�����&.w�I�����C������P4�U*���%�������jK;��+��,F���:x(������0+����I�>N�)}�,�Z�Sh�J����iy�����lO��_���Y���k��G�A+j��K�<F��#L�Z�t�>�D��BG�xk�k�,�K0�^��
h)�1|"����)�������`&���G���s'����j��u�-�m������3�<ZQ ��>���f(�O���t�����ia�����^n����'�"���FJ@U�A?�O't�Qm��J��P���w<=wGh6z����S�8@6��������N�����.�#(6(%2�XS���U1p3�������lUv5��d�W��gW����G��9<8�}>A���]����?����9W�w�����{������3��� ����d�e�?��'�����D����Y�Z.#j%N��r�����;��9z�^`��j6������� �]1�A��A�l?�X�{�u�*,+<^P��_�x85_����`2��  s�fc\�
PF�ge�Y��2s����j���?�
��D�U-�T�v*9�m���c�
�8�����rb�	���YbH"�#����<���]h�$����>T���:�6T5��T�:�7��`�B�!/yc���=����_%9�����M��mo����~�e[&��8��� '�����U�����om�?n�|��)�e{������.S��bU����B�*����%m j��a �dvt�qm�g>�3�f��L���E�gei}�9�K��6���v���7w��������?Nn�`����{����wp�����7���; 	�������W_}�07^l��=��'�8��R�2|/]�F�j�0)L�5C����z&�[��_�����d�!���u����o���/:����}&y��S)�`�h���p2-t����Xe��7�kZ����������>���O.G_�������6Jv�*���Y|��je��������I�����Y�_N0��KW�eW*+�D��.Pp��-���_��LSq\Ztg����3X&��'��,��������L��[O�������6���)VgB�}{g�������������/^��E�Wc��h \�qgp����e����?;���=���:%����U�e��gE����Y*���Oj�1���U���
�F�SQ\�u���>����f��'�S���0�d-��W��^�bdi���H��!q
RE��I]���	5�M����L����tTi9���������x�Zf|��:���K&��>|Ax[8�,���GS�����R��H��PJ�5:�h>�o`y��:$L����f�x_�I��i����rU�\�e�����U�\���ve�7����L����jeB��kotbMK�� ��y�����h����_N'�A'A���)vI��9���h$I�Y��p�n���.�7(7�To�z5���X����Y�����,/�������E�SA�v5�V}�%)e��8���tB}����TEW���i��G,o�O�!km����� ���X����dN&�����R�z:�L8��<��;���0C{ba|_r�Q\n�/z�B�Td��#x[����G��z#�2��`(�����|�(� 	���b�h��<O:�����������f@|O�b��n8�fX���Jqp���0�R�nW����r��(&��wF�d��A�����6S��a���o�F�lO�����|�����r����
��|��QFy�U���c�vn�sP�Y2��ZRDSQ^Oj<��������%�g��}VP�|�.�����Ju����R����H�K�XP��%�J�������(��>����\Qq�bV������@�����7a���t{q{3�n�k�9)We
�4������&���d����S��33"�Y2s��R'����]������a^Ni������?
��u����0RU,������~����v�TP�����,:H}=�/.�X���=�V������,���,���u����l���Z�$h��.��$��`��������~�yF�*��K��?�	���@��e�����i�f���VF^��e�{�&u��0�f�(��@z�����J����:������qbt�V&�$+�����}�YUEXT��Qr'���9k���l~r~x��W�[���{�RYp�.t�YQ���U�C��./��.O'�g��,��QZ��k���>]_]�5�l�����9P`mv��T����4�|�`�������.���pu�������U��������8"I3��A��t�����$A^e�-T���;�����
��4�d����E/�����"���`�E;�N�
�����4+8��8u�\P�!�A���nK��YS�SS��\��D7�F���v:�\T��?9��@_���xs�z���Be����B`0�=P5�b���t�^g��q}������E���&2N���o/����ht*����_T[���m�%+��4�������YO����@Dk��"�^�]��������y�F��w�K�%�����(���G��F%J^���?WLS�_*)��-����`�L�FwK�8�l���6C�,���M����H�\p���gw���T�����L�E�����up�����A�'+���G�B���O}!f�Bb��!bab	��E�#�*��QR=
���#�G���XFtq��Q��a^�eY)��������Q���4�:�}��~���������J:����/��P0
�]�����f��5�h@]�m�
7���'5��?-�Jo���I��(v�H��q�����i�M�1N�Ym`C�e�Y�H������q�`�����=(T(��[�`�g�X`��Qf��y���M����A"`V���	JM;��mYEwMy��.Y}���u�D�����2�.5c������rsU�e��EI3vg�6���g��%���ho�_A����h���E���m���6�_�Pa���#�h�
�S��0�*Fu��R
j�,�`��f,&�X>��K���	�m��=6�������E��b�]C�B��j;�� V��4�k�H���'z�tV�?���`�'�\!`��n���=0NMdX��h
�iuxKSr��uvV�w(�1l]������Bg�I��J�t�M�Oq$�������7dt����*���j`�{��$/uAd5�
5�U�!kQ2"*@�:���~<
����ZE����!���-��b��l	�~�)����$�������Opc��<J�><TR�#���0u�J�^��q}�rGLC9�y7Cc��k���l��Y?�^a�����7�����F�Y���h���@��w�BK�@�������
jBV��JX����>cy��H�j1?��s�&�,���YM��F�����-�i���T�XPxC��x�����Q���u�"0�����>14m���1Q�x*Fmq�����k�X��X%?��@����O%?H�Q�^�5:��n�y2U)��+�2.��T`���A�r3��MoP���f��9_k����%���Q��f��
�r����w�w�"Q�5�1a��i57����%/������X���,v�����u�~m���������Z���{�i|��"���g{�-AC�2��p���`;F���q��l'[]]8�a�i�R�{�%N�uKxG�'�������'�w��A{;I�B"����7����wu5���~^���yxzu1y8�����zxvy3=<?xszx=y�)/�t����g�^]N����������z"'��|���������'rb�y�;��B���	��n��������.O�>�^b}�Ss����:�e�N�^|y������������_L:�Y	Uzs�{{q1��_�:��:�L'0���=|�z�7�y(������������n|�����f3�du0�]^M���ls}k�]#'�7�~�`��fmJo>�:� z�������'�dO�U�'7���m�^��m@��L��a�����G���E'�F����n�?�t�YFRk���J@�%���`�����	Y�V�w?��[��6{O�����K9H��Jk��N���h�U�r�r)6|�Ta^0�,��d)�d����!K!hW�+	J���6��.8c��b_��?�,/������g�0P{��bm,�����7���1�ZXZ$��geq���>;o��b0H�x��Q9��~%��{��=K��FI�v$����p&�nI�����^��{�����Br>�v1��\��;����&c���s�O�{<��7�l������$������FLW�Z��{�jL���MU��r�x_s��z�WZ��W��� Z�%�x-;T�v�?^�*��(��g~��O������/k�Q�\Z.<��9/��O�@3�V1�g?���]�`���:k�F����k�U�5N��?���e%��gX6zC�(�\��{l��q���� 5���
zX9��w�#��^1���'�y�[���t���a�Z���k��0��h?��@Ja�x9F�1���1��W�+e^5�GM)��g�.wF����3�d�Z�*�~n>�O+a��������)D!���ZWq^����E*j����T	^�,]`�z�3HEm@�V���5���)���W;��� L�������<�	��'�u��E�lK��P� ��������	jtq������#T���|����%��
&��X��:�����~������,C��UQa��nW�G�����v��~�W�b���#�{�c2S���6����
x�V2��0Ts��q�!�T�@���s5t�;�@|F��k�K�}��4�����R�^���$��2����K�����H�ZAH14���r(�@Ys�(��`��9mtx����~/����-�����TL�����b^����c�\|��Z�������vo�<��*�|����5D��Hd�#���-1]����f�;����H�x�-���7���Y ��a��w,'��4i���.X�!F��Gn�������RS�U�6�#�lX����&��/�J!��yeV`��+I�D'��1������5������b�b�Ok�5����E���v�U����2�����cGI|�|. C�J�Y��-��^%�=����y�
TN�0�G�y�x)����;Mq~��1<�T!y�s��g�Kiw1d���F����m\��7};��2D����o���y��H�
(�2�Q5�eV^$���g���}�)�������{�x:���	�6���m3^m��ru�`�6���������X�\s�X���
0>��--��������E�4��S�9�U���2�?f�JK�Me`N��b��}u��K�rU�p�����12�LDF�;�>�����1JG����R!��-��
�!��-]�V�A�#{��QfX��yq�0 ��a�,n�&�Gx��YI������A��(
c��6aP����K��T�������� �DoY'�&���-t�H�d��A�K�O� ���(sa���6���'�cq�(se��%����=�g)�l���P�*�/`�$���F�r
�q�HP�;KXOb�'X��D!��m���#$i����C���5�Pm�l������	�w�N����Z�������0�V�4T��{G{�l���IJ�.�k�{B�6zj���G�c�&��������)E����<N�r^��8�	�x4T�(.=���m2;Z��1FVN`jeh��5����D�)���=��m	���D���-m��T;)�����d���F:#�bY�������
���K`������>�;'�������]Op�(�(X�����������f��q�)f�xd���Y��P�{��pw�jK���G�*���8S}!�t��d����i����*�u�N������%��r��d��w�Z�f$�c�$�}��O5P���A�����H��o	���#iK=��5DM_�a��/���p��O5�
��x������|���A��G�aVAu���#��_�Xc�z�����v����N���1�� K�������%�w������0���c;����Z>��m���3v�$
v)�G�z�m#�h�F��.�T��p����]����
|�)m���
����m8��n�C@���K�F)q�e��CO�f�j�5k,D���!D%xV����_�3f�k��������`!�ko�jM�����v��������X� ��[���Bc�2vN�mW@����V��^�Y�"b���r���D�n����}����D��I���h�����2l'm�������p�1����
����z�e�����y�,�S�EB#f�����~�t}u|qx�q
�������������������6�?��v�������e}��}���}�_�n|�]����^���r;�}��������P�����&�d�bG�`|jj��0U���f�1��"[y(����^�=S3�G���n�����;��eg�?����?L�L������L�������"w���$�~wo}o����Lq39�O����������N��.���K�w��V�u��$#���T��%[��xw����:� "������[4�]���#JP5�"�������t�7��������~y�h����9v��7^�n��/�m�x��[�
+P�����.
K���9*[Ox^���������r�W|�#������%�����Ng��BI�
�lX��Ij�oCcH�������ioL+��gp^����"3	��������aJ,Yo�z:o�F��Z�M�m}�>}�?H�(��S)4K�Kt�2���Wv�R�X��jl���Z��2]�V����
H�G�k���~��Y����P����h0�T���.w��F��d�R*!�����p�4�d�0��u *9�=�*5�t��:�N
k��% ����j�b6�^�f\%u�mgz�Y<�u���������P��I�]�6;A�t�����L���&GC��*��U�`�s�C#��f��0���H�4���.��A��C���156����\>o$���D6��D�T��b�����a�L������Z�@��;���Lz@������lm�r:����4�E��r3�\d����_���uD������A�4�4��I��%�A;���e,��<?��������b�^:�� �������a=<�S��J�������Q����G���n���J	�����"�h��Z|�J��M�j���T��O-}�$���2�j�g0Q��*�l�b�	��r�Y�y"�RS]aq��t��U})����/{�.�O
<s��D������������5�?���i�����/�x�?����o|���>�:�����=y��i��7�#������7�{O�������<�I�j��f�����n�?=�=�����R�m�s��5�4V���[F�,�&IF:p0la���a�P
B���@������:����/�?�1{�����j�yG����������uiWs��?�jv��T�0�2N���dbJN���^��AS������we���-�Gl�����o�8=:<���������uv���5`dr>��^����4��/*�?g�r��
�X����8��kiR��@b��������>Y5��.�a��W��<�M�����=�@����y��B�7��.�5����1��#IrE;���k2���M/7������B�������g�!�7,O�`i�y���u��F�������Ln�n��'���������_@�?�������|�e���$�[�9\JT��b�"����"m�u�}�`j�k�
fU��V4�;�"�fm���0�����3��nv��b���Z����O	e�j�ny�����!i�IV���SZ �}@�Z��	:���� i�FV����T����H��tZ{���[L6*�xP��d�C�^P'��F-��0BK��q�;(F�,���+�����%��b���5����}8S����5$
�C��
G��:�`F9Yj(XV5�{�PA86!Xf�+M��C���Oq$�
modify-while-writing.ctext/x-csrc; charset=US-ASCII; name=modify-while-writing.cDownload
#40Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#38)
Re: Direct I/O

On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Didn't seem to make any difference.

Thanks for testing. I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.

#41Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#40)
Re: Direct I/O

On 2023-04-09 Su 08:39, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan<andrew@dunslane.net> wrote:

Didn't seem to make any difference.

Thanks for testing. I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.

Googling agrees with you about checksums.  But I'm still wondering if we
shouldn't disable COW for the build directory etc. It is suggested at [1]<http://www.infotinks.com/btrfs-disabling-cow-file-directory-nodatacow/&gt;:

Recommend to set nodatacow – turn cow off – for the files that
require fast IO and tend to get very big and get alot of random
writes: such VMDK (vm disks) files and the like.

I'll give it a whirl.

cheers

andrew

[1]: <http://www.infotinks.com/btrfs-disabling-cow-file-directory-nodatacow/&gt;

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#42Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#41)
Re: Direct I/O

On 2023-04-09 Su 09:14, Andrew Dunstan wrote:

On 2023-04-09 Su 08:39, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan<andrew@dunslane.net> wrote:

Didn't seem to make any difference.

Thanks for testing. I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.

Googling agrees with you about checksums.  But I'm still wondering if
we shouldn't disable COW for the build directory etc. It is suggested
at [1]:

Recommend to set nodatacow – turn cow off – for the files that
require fast IO and tend to get very big and get alot of random
writes: such VMDK (vm disks) files and the like.

I'll give it a whirl.

with COW disabled, I can no longer generate a failure with the test.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#39)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

Boy, it's hard to look at that trace and not call it a filesystem bug.
Given the apparent dependency on COW, I wonder if this has something
to do with getting confused about which copy is current?

Another thing that struck me is that the two calls from pid 26040
are issued on different FDs. I checked the strace log and verified
that these do both refer to "base/5/16384". It looks like there was
a cache flush at about 23:26:48.575023 that caused 26040 to close
and later reopen all its database relation FDs. Maybe that is
somehow contributing to the filesystem's confusion? And more to the
point, could that explain why other O_DIRECT users aren't up in arms
over this bug? Maybe they don't switch FDs as readily as we do.

regards, tom lane

#44Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#36)
Re: Direct I/O

Hi,

On 2023-04-08 21:29:54 -0700, Noah Misch wrote:

On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:

There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

[backtrace]

I get an ERROR, not a PANIC:

What I meant is that I changed the code to use PANIC, to make it easier to get
a backtrace.

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

I would not. This is only going to come up where the user goes out of the way
to use near-minimum shared_buffers.

It's not *just* that scenario. With a few concurrent connections you can get
into problematic territory even with halfway reasonable shared buffers.

Here's a quick prototype of this approach.

This looks fine. I'm not enthusiastic about incurring post-startup cycles to
cater to allocating less than 512k*max_connections of shared buffers, but I
expect the cycles in question are negligible here.

Yea, I can't imagine it'd matter, compared to the other costs. Arguably it'd
allow us to crank up the maximum batch size further, even.

Greetings,

Andres Freund

#45Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#44)
Re: Direct I/O

On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote:

On 2023-04-08 21:29:54 -0700, Noah Misch wrote:

On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

I would not. This is only going to come up where the user goes out of the way
to use near-minimum shared_buffers.

It's not *just* that scenario. With a few concurrent connections you can get
into problematic territory even with halfway reasonable shared buffers.

I am not familiar with such cases. You could get there with 64MB shared
buffers and 256 simultaneous commits of new-refilenode-creating transactions,
but I'd still file that under going out of one's way to use tiny shared
buffers relative to the write activity. What combination did you envision?

#46Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#43)
Re: Direct I/O

On Mon, Apr 10, 2023 at 8:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Boy, it's hard to look at that trace and not call it a filesystem bug.

Agreed.

Given the apparent dependency on COW, I wonder if this has something
to do with getting confused about which copy is current?

Yeah, I suppose it would require bogus old page versions (or I guess
alternatively completely mixed up page offsets) rather than bogus
zeroed pages to explain the too-high count observed in one of crake's
failed runs: I guess it counted some pre-updated tuples that were
supposed to be deleted and then counted the post-updated tuples on
later pages (insert joke about the Easter variant of the Halloween
problem). It's just that in the runs I've managed to observe and
analyse, the previous version always happened to be zeros.

#47Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#39)
Re: Direct I/O

Hi,

On 2023-04-10 00:17:12 +1200, Thomas Munro wrote:

I think there are two separate bad phenomena.

1. A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as . I can
reproduce that with a stand-alone program, attached. The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line. The system log/dmesg gets some
warnings.

I think we really need to think about whether we eventually we want to do
something to avoid modifying pages while IO is in progress. The only
alternative is for filesystems to make copies of everything in the IO path,
which is far from free (and obviously prevents from using DMA for the whole
IO). The copy we do to avoid the same problem when checksums are enabled,
shows up quite prominently in write-heavy profiles, so there's a "purely
postgres" reason to avoid these issues too.

2. The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved. I attach a complete syscall trace of a repro session. (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.) Working back from the error message,

ERROR: invalid page in block 78 of relation base/5/16384,

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing. We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it. This was obtained with:

Have you tried to write a reproducer for this that doesn't involve postgres?
It'd certainly be interesting to know the precise conditions for this. E.g.,
can this also happen without O_DIRECT, if cache pressure is high enough for
the page to get evicted soon after (potentially simulated with fadvise or
such)?

We should definitely let the brtfs folks know of this issue... It's possible
that this bug was recently introduced even. What kernel version did you repro
this on Thomas?

I wonder if we should have a postgres-io-torture program in our tree for some
of these things. We've found issues with our assumptions on several operating
systems and filesystems, without systematically looking. Or even stressing IO
all that hard in our tests.

Greetings,

Andres Freund

#48Andrea Gelmini
andrea.gelmini@gmail.com
In reply to: Andres Freund (#47)
Re: Direct I/O

Il giorno lun 10 apr 2023 alle ore 04:58 Andres Freund
<andres@anarazel.de> ha scritto:

We should definitely let the brtfs folks know of this issue... It's possible
that this bug was recently introduced even. What kernel version did you repro
this on Thomas?

In these days on BTRFS ml they are discussing about Direct I/O data
corruption. No patch at the moment, they are still discussing how to
address it:
https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/

Ciao,
Gelma

#49Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#47)
Re: Direct I/O

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:

Have you tried to write a reproducer for this that doesn't involve postgres?

I tried a bit. I'll try harder soon.

... What kernel version did you repro
this on Thomas?

Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I
set up a test btrfs in case someone else wants a head start:

truncate -s2G 2GB.img
sudo losetup --show --find 2GB.img
sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
sudo mkdir /mnt/tmp
sudo mount /dev/loop0 /mnt/tmp
sudo chown $(whoami) /mnt/tmp

cd /mnt/tmp
/path/to/initdb -D pgdata
... (see instructions further up for postgres command line + queries to run)

#50Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#49)
Re: Direct I/O

On Mon, Apr 10, 2023 at 7:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Debian's 6.0.10-2 kernel (Debian 12 on a random laptop).

Realising I hadn't updated for a bit, I did so and it still reproduces on:

$ uname -a
Linux x1 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1
(2023-03-19) x86_64 GNU/Linux

#51Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#49)
Re: Direct I/O

Hi,

On 2023-04-10 19:27:27 +1200, Thomas Munro wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:

Have you tried to write a reproducer for this that doesn't involve postgres?

I tried a bit. I'll try harder soon.

... What kernel version did you repro
this on Thomas?

Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I
set up a test btrfs in case someone else wants a head start:

truncate -s2G 2GB.img
sudo losetup --show --find 2GB.img
sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
sudo mkdir /mnt/tmp
sudo mount /dev/loop0 /mnt/tmp
sudo chown $(whoami) /mnt/tmp

cd /mnt/tmp
/path/to/initdb -D pgdata
... (see instructions further up for postgres command line + queries to run)

I initially failed to repro the issue with these instructions. Turns out that
the problem does not happen if huge pages are in used - I'd configured huge
pages, so the default huge_pages=try succeeded. As soon as I disable
huge_pages explicitly, it reproduces.

Another interesting bit is that if checksums are enabled, I also can't
reproduce the issue. Together with the huge_page issue, it does suggest that
this is somehow related to page faults. Which fits with the thread Andrea
referenced...

Greetings,

Andres Freund

#52Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#51)
Re: Direct I/O

Hi,

On 2023-04-10 18:55:26 -0700, Andres Freund wrote:

On 2023-04-10 19:27:27 +1200, Thomas Munro wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:

Have you tried to write a reproducer for this that doesn't involve postgres?

I tried a bit. I'll try harder soon.

... What kernel version did you repro
this on Thomas?

Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I
set up a test btrfs in case someone else wants a head start:

truncate -s2G 2GB.img
sudo losetup --show --find 2GB.img
sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
sudo mkdir /mnt/tmp
sudo mount /dev/loop0 /mnt/tmp
sudo chown $(whoami) /mnt/tmp

cd /mnt/tmp
/path/to/initdb -D pgdata
... (see instructions further up for postgres command line + queries to run)

I initially failed to repro the issue with these instructions. Turns out that
the problem does not happen if huge pages are in used - I'd configured huge
pages, so the default huge_pages=try succeeded. As soon as I disable
huge_pages explicitly, it reproduces.

Another interesting bit is that if checksums are enabled, I also can't
reproduce the issue. Together with the huge_page issue, it does suggest that
this is somehow related to page faults. Which fits with the thread Andrea
referenced...

The last iteration of the fix that I could find is:
https://lore.kernel.org/linux-btrfs/20230328051957.1161316-1-hch@lst.de/T/#m1afdc3fe77e10a97302e7d80fed3efeaa297f0f7

And the fix has been merged into
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next

I think that means it'll have to wait for 6.4 development to open (in a few
weeks), and then will be merged into the stable branches from there.

Greetings,

Andres Freund

#53Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#52)
Re: Direct I/O

On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <andres@anarazel.de> wrote:

And the fix has been merged into
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next

I think that means it'll have to wait for 6.4 development to open (in a few
weeks), and then will be merged into the stable branches from there.

Great! Let's hope/assume for now that that'll fix phenomenon #2.
That still leaves the checksum-vs-concurrent-modification thing that I
called phenomenon #1, which we've not actually hit with PostgreSQL yet
but is clearly possible and can be seen with the stand-alone
repro-program I posted upthread. You wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:

I think we really need to think about whether we eventually we want to do
something to avoid modifying pages while IO is in progress. The only
alternative is for filesystems to make copies of everything in the IO path,
which is far from free (and obviously prevents from using DMA for the whole
IO). The copy we do to avoid the same problem when checksums are enabled,
shows up quite prominently in write-heavy profiles, so there's a "purely
postgres" reason to avoid these issues too.

+1

I wonder what the other file systems that maintain checksums (see list
at [1]https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities) do when the data changes underneath a write. ZFS's policy is
conservative[2]https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs, while BTRFS took the demons-will-fly-out-of-your-nose
route. I can see arguments for both approaches (ZFS can only reach
zero-copy optimum by turning off checksums completely, while BTRFS is
happy to assume that if you break this programming rule that is not
written down anywhere then you must never want to see your data ever
again). What about ReFS? CephFS?

I tried to find out what POSIX says about this WRT synchronous
pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't
allow), but couldn't find it in my first attempt. It *does* say it's
undefined for aio_write() (which means that my prototype
io_method=posix_aio code that uses that stuff is undefined in presense
of hintbit modifications). I don't really see why it should vary
between synchronous and asynchronous interfaces (considering the
existence of threads, shared memory etc, the synchronous interface
only removes one thread from list of possible suspects that could flip
some bits).

But yeah, in any case, it doesn't seem great that we do that.

[1]: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities
[2]: https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs

#54Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#53)
Re: Direct I/O

On Tue, Apr 11, 2023 at 2:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I tried to find out what POSIX says about this

(But of course whatever it might say is of especially limited value
when O_DIRECT is in the picture, being completely unstandardised.
Really I guess all they meant was "if you *copy* something that's
moving, who knows which bits you'll copy"... not "your data might be
incinerated with lasers".)

#55Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#45)
Re: Direct I/O

Hi,

On 2023-04-09 16:40:54 -0700, Noah Misch wrote:

On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote:

It's not *just* that scenario. With a few concurrent connections you can get
into problematic territory even with halfway reasonable shared buffers.

I am not familiar with such cases. You could get there with 64MB shared
buffers and 256 simultaneous commits of new-refilenode-creating transactions,
but I'd still file that under going out of one's way to use tiny shared
buffers relative to the write activity. What combination did you envision?

I'd not say it's common, but it's less crazy than running with 128kB of s_b...

There's also the issue that log_newpage_range() is used in number of places
where we could have a lot of pre-existing buffer pins. So pinning another 64
buffers could tip us over.

Greetings,

Andres Freund

#56Christoph Berg
myon@debian.org
In reply to: Thomas Munro (#1)
Re: Direct I/O

Hi,

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

16:21:16 Bailout called. Further testing stopped: pg_ctl start failed
16:21:16 t/004_io_direct.pl ..............
16:21:16 Dubious, test returned 255 (wstat 65280, 0xff00)
16:21:16 No subtests run
16:21:16
16:21:16 Test Summary Report
16:21:16 -------------------
16:21:16 t/004_io_direct.pl (Wstat: 65280 (exited 255) Tests: 0 Failed: 0)
16:21:16 Non-zero exit status: 255
16:21:16 Parse errors: No plan found in TAP output
16:21:16 Files=4, Tests=65, 9 wallclock secs ( 0.03 usr 0.02 sys + 3.78 cusr 1.48 csys = 5.31 CPU)
16:21:16 Result: FAIL

16:21:16 ******** build/src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log ********
16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG: starting PostgreSQL 16devel (Debian 16~~devel-1.pgdg+~20230411.2256.gc03c2ea) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG: listening on Unix socket "/tmp/s0C4hWQq82/.s.PGSQL.54693"
16:21:16 2023-04-11 23:21:16.433 UTC [25994] LOG: database system was shut down at 2023-04-11 23:21:16 UTC
16:21:16 2023-04-11 23:21:16.434 UTC [25994] PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument
16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG: startup process (PID 25994) was terminated by signal 6: Aborted
16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG: aborting startup due to startup process failure
16:21:16 2023-04-11 23:21:16.526 UTC [25991] LOG: database system is shut down

16:21:16 ******** build/src/test/modules/test_misc/tmp_check/t_004_io_direct_main_data/pgdata/core ********
16:21:17
16:21:17 warning: Can't open file /dev/shm/PostgreSQL.3457641370 during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /dev/shm/PostgreSQL.2391834648 during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /SYSV00000dea (deleted) during file-backed mapping note processing
16:21:17 [New LWP 25994]
16:21:17 [Thread debugging using libthread_db enabled]
16:21:17 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
16:21:17 Core was generated by `postgres: main: startup '.
16:21:17 Program terminated with signal SIGABRT, Aborted.
16:21:17 #0 0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 #0 0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #1 0x00007f176c542ef2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #2 0x00007f176c52d472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #3 0x000055a7ba7978a1 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=0x55a7ba810560 <__func__.47> "XLogFileInitInternal") at ./build/../src/backend/utils/error/elog.c:604
16:21:17 edata = 0x55a7baae3e20 <errordata>
16:21:17 elevel = 23
16:21:17 oldcontext = 0x55a7bb471590
16:21:17 econtext = 0x0
16:21:17 __func__ = "errfinish"
16:21:17 #4 0x000055a7ba21759c in XLogFileInitInternal (logsegno=1, logtli=logtli@entry=1, added=added@entry=0x7ffebc6c8a3f, path=path@entry=0x7ffebc6c8a40 "pg_wal/00000001", '0' <repeats 15 times>, "1") at ./build/../src/backend/access/transam/xlog.c:2944
16:21:17 __errno_location = <optimized out>
16:21:17 tmppath = "0\214l\274\376\177\000\000\321\330~\272\247U\000\000\005Q\223\272\247U\000\000p\214l\274\376\177\000\000`\214l\274\376\177\000\000\212\335~\000\v", '\000' <repeats 31 times>, "\247U\000\000\000\000\000\000\000\177\000\000*O\202\272\247U\000\000\254\206l\274\376\177\000\000\000\000\000\000\v", '\000' <repeats 23 times>, "0\000\000\000\000\000\000\000\247U\000\000\000\000\000\000\000\000\000\000\001Q\223\272\247U\000\000\240\215l\274\376\177\000\000\376\377\377\377\000\000\000\0000\207l\274\376\177\000\000[\326~\272\247U\000\0000\207l\274\376\177\000\000"...
16:21:17 installed_segno = 0
16:21:17 max_segno = <optimized out>
16:21:17 fd = <optimized out>
16:21:17 save_errno = <optimized out>
16:21:17 open_flags = 194
16:21:17 __func__ = "XLogFileInitInternal"
16:21:17 #5 0x000055a7ba35a1d5 in XLogFileInit (logsegno=<optimized out>, logtli=logtli@entry=1) at ./build/../src/backend/access/transam/xlog.c:3099
16:21:17 ignore_added = false
16:21:17 path = "pg_wal/00000001", '0' <repeats 15 times>, "1\000\220\312P\273\247U\000\000/\375Yl\027\177\000\000\220\252P\273\247U\000\000\001", '\000' <repeats 15 times>, "\220\252P\273\247U\000\000\300\212l\274\376\177\000\000\002\261{\272\247U\000\000\220\252P\273\247U\000\000\220\252P\273\247U\000\000\001", '\000' <repeats 15 times>, "\340\212l\274\376\177\000\000\021\032|\272\247U\000\000\000\000\000\000\000\000\000\000\240\312P\273\247U\000\0000\213l\274\376\177\000\000\350\262{\272\247U\000\000\001", '\000' <repeats 16 times>, "\256\023i\027\177\000\000"...
16:21:17 fd = <optimized out>
16:21:17 __func__ = "XLogFileInit"
16:21:17 #6 0x000055a7ba35bab3 in XLogWrite (WriteRqst=..., tli=tli@entry=1, flexible=flexible@entry=false) at ./build/../src/backend/access/transam/xlog.c:2137
16:21:17 EndPtr = 21954560
16:21:17 ispartialpage = true
16:21:17 last_iteration = <optimized out>
16:21:17 finishing_seg = <optimized out>
16:21:17 curridx = 7
16:21:17 npages = 0
16:21:17 startidx = 0
16:21:17 startoffset = 0
16:21:17 __func__ = "XLogWrite"
16:21:17 #7 0x000055a7ba35c8e0 in XLogFlush (record=21949600) at ./build/../src/backend/access/transam/xlog.c:2638
16:21:17 insertpos = 21949600
16:21:17 WriteRqstPtr = 21949600
16:21:17 WriteRqst = <optimized out>
16:21:17 insertTLI = 1
16:21:17 __func__ = "XLogFlush"
16:21:17 #8 0x000055a7ba36118e in XLogReportParameters () at ./build/../src/backend/access/transam/xlog.c:7620
16:21:17 xlrec = {MaxConnections = 100, max_worker_processes = 8, max_wal_senders = 0, max_prepared_xacts = 0, max_locks_per_xact = 64, wal_level = 1, wal_log_hints = false, track_commit_timestamp = false}
16:21:17 recptr = <optimized out>
16:21:17 #9 StartupXLOG () at ./build/../src/backend/access/transam/xlog.c:5726
16:21:17 Insert = <optimized out>
16:21:17 checkPoint = <optimized out>
16:21:17 wasShutdown = true
16:21:17 didCrash = <optimized out>
16:21:17 haveTblspcMap = false
16:21:17 haveBackupLabel = false
16:21:17 EndOfLog = 21949544
16:21:17 EndOfLogTLI = <optimized out>
16:21:17 newTLI = 1
16:21:17 performedWalRecovery = <optimized out>
16:21:17 endOfRecoveryInfo = <optimized out>
16:21:17 abortedRecPtr = <optimized out>
16:21:17 missingContrecPtr = 0
16:21:17 oldestActiveXID = <optimized out>
16:21:17 promoted = false
16:21:17 __func__ = "StartupXLOG"
16:21:17 #10 0x000055a7ba5b4d00 in StartupProcessMain () at ./build/../src/backend/postmaster/startup.c:267
16:21:17 No locals.
16:21:17 #11 0x000055a7ba5ab0cf in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at ./build/../src/backend/postmaster/auxprocess.c:141
16:21:17 __func__ = "AuxiliaryProcessMain"
16:21:17 #12 0x000055a7ba5b0aa3 in StartChildProcess (type=StartupProcess) at ./build/../src/backend/postmaster/postmaster.c:5369
16:21:17 pid = <optimized out>
16:21:17 __func__ = "StartChildProcess"
16:21:17 save_errno = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 __errno_location = <optimized out>
16:21:17 #13 0x000055a7ba5b45d6 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x55a7bb471450) at ./build/../src/backend/postmaster/postmaster.c:1455
16:21:17 opt = <optimized out>
16:21:17 status = <optimized out>
16:21:17 userDoption = <optimized out>
16:21:17 listen_addr_saved = <optimized out>
16:21:17 i = <optimized out>
16:21:17 output_config_variable = <optimized out>
16:21:17 __func__ = "PostmasterMain"
16:21:17 #14 0x000055a7ba29fd62 in main (argc=4, argv=0x55a7bb471450) at ./build/../src/backend/main/main.c:200
16:21:17 do_check_root = <optimized out>

Apologies if this was already reported elsewhere in the thread, I
skimmed it but the problems looked different.

Christoph

#57Thomas Munro
thomas.munro@gmail.com
In reply to: Christoph Berg (#56)
Re: Direct I/O

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

Hi Christoph,

That's an interesting one. I was half expecting to see that on some
unusual systems, which is why I made the test check which OS it is and
exclude those that are known to fail with EINVAL or ENOTSUPP on their
common/typical file systems. But if it's going to be Linux, that's
not going to work. I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

#58Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#57)
1 attachment(s)
Re: Direct I/O

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I tested this on OpenBSD, which has no O_DIRECT, so that tests the
first reason to skip.

Does it skip OK on your system, for the second reason? Should we be
more specific about the errno?

As far as I know, the only systems on the build farm that should be
affected by this change are the illumos boxen (they have O_DIRECT,
unlike Solaris, but perl's $^O couldn't tell the difference between
Solaris and illumos, so they didn't previously run this test).

One thing I resisted the urge to do is invent PG_TEST_SKIP, a sort of
anti-PG_TEST_EXTRA. I think I'd rather hear about it if there is a
system out there that passes the pre-flight check, but fails later on,
because we'd better investigate why. That's basically the point of
shipping this "developer only" feature long before serious use of it.

Attachments:

0001-Skip-the-004_io_direct.pl-test-if-a-pre-flight-check.patchtext/x-patch; charset=US-ASCII; name=0001-Skip-the-004_io_direct.pl-test-if-a-pre-flight-check.patchDownload
From 72e865835bcf1c9dce2090de0da66839908133c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Apr 2023 16:26:54 +1200
Subject: [PATCH] Skip the 004_io_direct.pl test if a pre-flight check fails.

Previously the test had a list of OSes that direct I/O was expected to
work on, with the idea that we could easily adjust that if problems
showed up on less common systems.  That didn't take account of the
possibility of running the tests on an OS where it works on typical
filesystems (eg all the systems our build farm), but doesn't work for
unusual systems like overlayfs, as Christoph discovered.

The new approach is to try to create a file with O_DIRECT from perl.  If
that fails, we'll log the errno and skip the whole test.

Reported-by: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/ZDYd4A78cT2ULxZZ%40msg.df7cb.de

diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
index 5a2dd0d288..467b3bfc02 100644
--- a/src/test/modules/test_misc/t/004_io_direct.pl
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -2,19 +2,44 @@
 
 use strict;
 use warnings;
+use Fcntl;
+use IO::File;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-# Systems that we know to have direct I/O support, and whose typical local
-# filesystems support it or at least won't fail with an error.  (illumos should
-# probably be in this list, but perl reports it as solaris.  Solaris should not
-# be in the list because we don't support its way of turning on direct I/O, and
-# even if we did, its version of ZFS rejects it, and OpenBSD just doesn't have
-# it.)
-if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+# We know that macOS has F_NOCACHE, and we know that Windows has
+# FILE_FLAG_NO_BUFFERING, and we assume that their typical file systems will
+# accept those flags.  For every other system, we'll probe for O_DIRECT
+# support.
+
+if ($^O ne 'darwin' && $^O ne 'MSWin32')
 {
-	plan skip_all => "no direct I/O support";
+	# Perl's Fcntl module knows if this system's <fcntl.h> has O_DIRECT but can
+	# only tell us by reporting an error, so we copy a trick from File/stat.pm
+	# and probe for the definition with eval.
+	no strict 'refs';    ## no critic (ProhibitNoStrict)
+	my $val = eval { &{'Fcntl::O_DIRECT'} };
+	if (defined $val)
+	{
+		use Fcntl qw(O_DIRECT);
+
+		# Next we want to find out if we can successfully open a file using
+		# that on the present filesystem.
+		my $f = IO::File->new(
+			"${PostgreSQL::Test::Utils::tmp_check}/test_o_direct_file",
+			O_RDWR | O_DIRECT | O_CREAT);
+		if (!$f)
+		{
+			plan skip_all =>
+			  "pre-flight attempt to open file with O_DIRECT fails with errno $!";
+		}
+		$f->close;
+	}
+	else
+	{
+		plan skip_all => "no O_DIRECT";
+	}
 }
 
 my $node = PostgreSQL::Test::Cluster->new('main');
-- 
2.39.2

#59Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#58)
Re: Direct I/O

On Wed, Apr 12, 2023 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

After trying a couple of things and doing some googling, it looks like
it's tmpfs that rejects it, not overlayfs, so I'd adjust that commit
message slightly. Of course it's a completely reasonable thing to
expect the tests to pass (or in this case be skipped) in a tmpfs, eg
/tmp on some distributions. (It's a strange to contemplate what
O_DIRECT means for tmpfs, considering that it *is* the page cache,
kinda, and I see people have been arguing about that for a couple of
decades since O_DIRECT was added to Linux; doesn't seem that helpful
to me that it rejects it, but 🤷).

#60Andrew Dunstan
andrew@dunslane.net
In reply to: Thomas Munro (#58)
Re: Direct I/O

On 2023-04-12 We 01:48, Thomas Munro wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying

if (Fcntl->can("O_DIRECT")) ...

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In reply to: Thomas Munro (#58)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself.

Indeed, and it has been since perl 5.003_07, released in 1996. And Fcntl
has known about O_DIRECT since perl 5.6.0, released in 2000, so we can
safely use both.

And the Fcntl eval trick that I copied from File::stat, and the
perl-critic suppression that requires?

[…]

+	no strict 'refs';    ## no critic (ProhibitNoStrict)
+	my $val = eval { &{'Fcntl::O_DIRECT'} };
+	if (defined $val)

This trick is only needed in File::stat because it's constructing the
symbol name dynamically. And because Fcntl by default exports all the
O_* and F_* constants it knows about, we can simply do:

if (defined &O_DIRECT)

+ {
+ use Fcntl qw(O_DIRECT);

The `use Fcntl;` above will already have imported this, so this is
redundant.

- ilmari

In reply to: Andrew Dunstan (#60)
Re: Direct I/O

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying

if (Fcntl->can("O_DIRECT")) ...

Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.

cheers

andrew

- ilmari

#63Andrew Dunstan
andrew@dunslane.net
In reply to: Dagfinn Ilmari Mannsåker (#62)
Re: Direct I/O

On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying

if (Fcntl->can("O_DIRECT")) ...

Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.

My understanding was that Fcntl only exported constants known to the OS.
That's certainly what its docco suggests, e.g.:

    By default your system's F_* and O_* constants (eg, F_DUPFD and
O_CREAT)
    and the FD_CLOEXEC constant are exported into your namespace.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In reply to: Andrew Dunstan (#63)
Re: Direct I/O

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying

if (Fcntl->can("O_DIRECT")) ...

Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.

My understanding was that Fcntl only exported constants known to the
OS. That's certainly what its docco suggests, e.g.:

    By default your system's F_* and O_* constants (eg, F_DUPFD and
O_CREAT)
    and the FD_CLOEXEC constant are exported into your namespace.

It's a bit more magical than that (this is Perl after all). They are
all exported (which implicitly creates stubs visible to `->can()`,
similarly to forward declarations like `sub O_FOO;`), but only the
defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is
true). The rest fall through to an AUTOLOAD¹ function that throws an
exception for undefined ones.

Here's an example (Fcntl knows O_RAW, but Linux does not define it):

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_RAW") ? "" : "not";
say defined &O_RAW ? "" : "not ", "defined";
say O_RAW;'
can
not defined
Your vendor has not defined Fcntl macro O_RAW, used at -e line 4

While O_DIRECT is defined:

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_DIRECT") ? "" : "not";
say defined &O_DIRECT ? "" : "not ", "defined";
say O_DIRECT;'
can
defined
16384

And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it
not a bareword, which would be a compile error under `use strict;` when
the sub doesn't exist at all):

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_FOO") ? "" : "not";
say defined &O_FOO ? "" : "not ", "defined";
say O_FOO();'
cannot
not defined
Undefined subroutine &main::O_FOO called at -e line 4.

cheers

andrew

- ilmari

[1]: https://perldoc.perl.org/perlsub#Autoloading

#65Andrew Dunstan
andrew@dunslane.net
In reply to: Dagfinn Ilmari Mannsåker (#64)
Re: Direct I/O

On 2023-04-12 We 12:38, Dagfinn Ilmari Mannsåker wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:

On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:

On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument

... I have a new idea: perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test. Looking into that now.

I think I have that working OK. Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)? I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself. And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying

if (Fcntl->can("O_DIRECT")) ...

Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.

My understanding was that Fcntl only exported constants known to the
OS. That's certainly what its docco suggests, e.g.:

    By default your system's F_* and O_* constants (eg, F_DUPFD and
O_CREAT)
    and the FD_CLOEXEC constant are exported into your namespace.

It's a bit more magical than that (this is Perl after all). They are
all exported (which implicitly creates stubs visible to `->can()`,
similarly to forward declarations like `sub O_FOO;`), but only the
defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is
true). The rest fall through to an AUTOLOAD¹ function that throws an
exception for undefined ones.

Here's an example (Fcntl knows O_RAW, but Linux does not define it):

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_RAW") ? "" : "not";
say defined &O_RAW ? "" : "not ", "defined";
say O_RAW;'
can
not defined
Your vendor has not defined Fcntl macro O_RAW, used at -e line 4

While O_DIRECT is defined:

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_DIRECT") ? "" : "not";
say defined &O_DIRECT ? "" : "not ", "defined";
say O_DIRECT;'
can
defined
16384

And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it
not a bareword, which would be a compile error under `use strict;` when
the sub doesn't exist at all):

$ perl -E '
use strict; use Fcntl;
say "can", main->can("O_FOO") ? "" : "not";
say defined &O_FOO ? "" : "not ", "defined";
say O_FOO();'
cannot
not defined
Undefined subroutine &main::O_FOO called at -e line 4.

*grumble* a bit too magical for my taste. Thanks for the correction.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#66Thomas Munro
thomas.munro@gmail.com
In reply to: Andrew Dunstan (#65)
Re: Direct I/O

Thanks both for looking, and thanks for the explanation Ilmari.
Pushed with your improvements. The 4 CI systems run the tests
(Windows and Mac by special always-expected-to-work case, Linux and
FreeBSD by successful pre-flight perl test of O_DIRECT), and I also
tested three unusual systems, two that skip for different reasons and
one that will henceforth run this test on the build farm so I wanted
to confirm locally first:

Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with
O_DIRECT failed: Invalid argument
OpenBSD: t/004_io_direct.pl .............. skipped: no O_DIRECT
illumos: t/004_io_direct.pl .............. ok

(Format different because those last two are autoconf, no meson on my
collection of Vagrant images yet...)

#67Christoph Berg
myon@debian.org
In reply to: Thomas Munro (#66)
Re: Direct I/O

Re: Thomas Munro

Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with
O_DIRECT failed: Invalid argument

I confirm it's working now:

t/004_io_direct.pl .............. skipped: pre-flight test if we can open a file with O_DIRECT failed: Invalid argument
All tests successful.

Thanks,
Christoph

#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#66)
Re: Direct I/O

Since the direct I/O commit went in, buildfarm animals
curculio and morepork have been issuing warnings like

hashpage.c: In function '_hash_expandtable':
hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf'

in places where there's a local variable of type PGIOAlignedBlock
or PGAlignedXLogBlock. I'm not sure why only those two animals
are unhappy, but I think they have a point: typical ABIs don't
guarantee alignment of function stack frames to better than
16 bytes or so. In principle the compiler could support a 4K
alignment request anyway by doing the equivalent of alloca(3),
but I do not think we can count on that to happen.

regards, tom lane

#69Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#68)
Re: Direct I/O

Hi,

On 2023-04-14 13:21:33 -0400, Tom Lane wrote:

Since the direct I/O commit went in, buildfarm animals
curculio and morepork have been issuing warnings like

hashpage.c: In function '_hash_expandtable':
hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf'

in places where there's a local variable of type PGIOAlignedBlock
or PGAlignedXLogBlock. I'm not sure why only those two animals
are unhappy, but I think they have a point: typical ABIs don't
guarantee alignment of function stack frames to better than
16 bytes or so. In principle the compiler could support a 4K
alignment request anyway by doing the equivalent of alloca(3),
but I do not think we can count on that to happen.

Hm. New-ish compilers seem to be ok with it. Perhaps we should have a
configure check whether the compiler is OK with that, and disable direct IO
support if not?

Greetings,

Andres Freund

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#69)
Re: Direct I/O

Andres Freund <andres@anarazel.de> writes:

On 2023-04-14 13:21:33 -0400, Tom Lane wrote:

... I'm not sure why only those two animals
are unhappy, but I think they have a point: typical ABIs don't
guarantee alignment of function stack frames to better than
16 bytes or so. In principle the compiler could support a 4K
alignment request anyway by doing the equivalent of alloca(3),
but I do not think we can count on that to happen.

Hm. New-ish compilers seem to be ok with it.

Oh! I was misled by the buildfarm label on morepork, which claims
it's running clang 10.0.1. But actually, per its configure report,
it's running

configure: using compiler=gcc (GCC) 4.2.1 20070719

which is the same as curculio. So that explains why nothing else is
complaining. I agree we needn't let 15-year-old compilers force us
into the mess that would be entailed by not treating these variables
as simple locals.

Perhaps we should have a
configure check whether the compiler is OK with that, and disable direct IO
support if not?

+1 for that, though. (Also, the fact that these animals aren't
actually failing suggests that 004_io_direct.pl needs expansion.)

regards, tom lane

#71Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#70)
Re: Direct I/O

Hi,

On 2023-04-14 15:21:18 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-04-14 13:21:33 -0400, Tom Lane wrote:

... I'm not sure why only those two animals
are unhappy, but I think they have a point: typical ABIs don't
guarantee alignment of function stack frames to better than
16 bytes or so. In principle the compiler could support a 4K
alignment request anyway by doing the equivalent of alloca(3),
but I do not think we can count on that to happen.

Hm. New-ish compilers seem to be ok with it.

Oh! I was misled by the buildfarm label on morepork, which claims
it's running clang 10.0.1. But actually, per its configure report,
it's running

configure: using compiler=gcc (GCC) 4.2.1 20070719

Huh. I wonder if that was an accident in the BF setup.

Perhaps we should have a
configure check whether the compiler is OK with that, and disable direct IO
support if not?

+1 for that, though. (Also, the fact that these animals aren't
actually failing suggests that 004_io_direct.pl needs expansion.)

It's skipped, due to lack of O_DIRECT:
[20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT

So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
a bit annoying structurally, because the PG*Aligned structs are defined in
c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.

I wonder if we should try to move those structs to fd.h as well...

Greetings,

Andres Freund

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#71)
Re: Direct I/O

Andres Freund <andres@anarazel.de> writes:

On 2023-04-14 15:21:18 -0400, Tom Lane wrote:

+1 for that, though. (Also, the fact that these animals aren't
actually failing suggests that 004_io_direct.pl needs expansion.)

It's skipped, due to lack of O_DIRECT:
[20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT

Hmm, I'd say that might be just luck. Whether the compiler honors weird
alignment of locals seems independent of whether the OS has O_DIRECT.

So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
a bit annoying structurally, because the PG*Aligned structs are defined in
c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.

I wonder if we should try to move those structs to fd.h as well...

I doubt they belong in c.h, so that could be plausible; except
I'm not convinced that testing O_DIRECT is sufficient.

regards, tom lane

#73Mikael Kjellström
mikael.kjellstrom@gmail.com
In reply to: Andres Freund (#71)
Re: Direct I/O

On 2023-04-14 21:33, Andres Freund wrote:

Oh! I was misled by the buildfarm label on morepork, which claims
it's running clang 10.0.1. But actually, per its configure report,
it's running

configure: using compiler=gcc (GCC) 4.2.1 20070719

Huh. I wonder if that was an accident in the BF setup.

I might have been when I reinstalled it a while ago.

I have the following gcc and clang installed:

openbsd_6_9-pgbf$ gcc --version
gcc (GCC) 4.2.1 20070719
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

openbsd_6_9-pgbf$ clang --version
OpenBSD clang version 10.0.1
Target: amd64-unknown-openbsd6.9
Thread model: posix
InstalledDir: /usr/bin

want me to switch to clang instead?

/Mikael

#74Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#72)
Re: Direct I/O

On Sat, Apr 15, 2023 at 7:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-04-14 15:21:18 -0400, Tom Lane wrote:

+1 for that, though. (Also, the fact that these animals aren't
actually failing suggests that 004_io_direct.pl needs expansion.)

It's skipped, due to lack of O_DIRECT:
[20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT

Hmm, I'd say that might be just luck. Whether the compiler honors weird
alignment of locals seems independent of whether the OS has O_DIRECT.

So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
a bit annoying structurally, because the PG*Aligned structs are defined in
c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.

I wonder if we should try to move those structs to fd.h as well...

I doubt they belong in c.h, so that could be plausible; except
I'm not convinced that testing O_DIRECT is sufficient.

As far as I can tell, the failure to honour large alignment attributes
even though the compiler passes our configure check that you can do
that was considered to be approximately a bug[1] or at least a thing
to be improved in fairly old GCC times but the fix wasn't back-patched
that far. Unfortunately the projects that were allergic to the GPL3
change but wanted to ship a compiler (or some motivation related to
that) got stuck on 4.2 for a while before they flipped to Clang (as
OpenBSD has now done). It seems hard to get excited about doing
anything about that on our side, and those systems are also spewing
other warnings. But if we're going to do it, it looks like the right
place would indeed be a new compiler check that the attribute exists
*and* generates no warnings with alignment > 16, something like that?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660

#75Thomas Munro
thomas.munro@gmail.com
In reply to: Mikael Kjellström (#73)
Re: Direct I/O

On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
<mikael.kjellstrom@gmail.com> wrote:

want me to switch to clang instead?

I vote +1, that's the system compiler in modern OpenBSD.

https://www.cambus.net/the-state-of-toolchains-in-openbsd/

As for curculio, I don't understand the motivation for maintaining
that machine. I'd rather know if OpenBSD 7.3 works.

#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#75)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
<mikael.kjellstrom@gmail.com> wrote:

want me to switch to clang instead?

I vote +1, that's the system compiler in modern OpenBSD.

Ditto, we need coverage of that.

As for curculio, I don't understand the motivation for maintaining
that machine. I'd rather know if OpenBSD 7.3 works.

Those aren't necessarily mutually exclusive :-). But I do agree
that recent OpenBSD is more important to cover than ancient OpenBSD.

regards, tom lane

#77Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Tom Lane (#76)
Re: Direct I/O

On 2023-04-15 05:22, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
<mikael.kjellstrom@gmail.com> wrote:

want me to switch to clang instead?

I vote +1, that's the system compiler in modern OpenBSD.

Ditto, we need coverage of that.

OK. I switched to clang on morepork now.

As for curculio, I don't understand the motivation for maintaining
that machine. I'd rather know if OpenBSD 7.3 works.

Those aren't necessarily mutually exclusive :-). But I do agree
that recent OpenBSD is more important to cover than ancient OpenBSD.

So do you want me to switch that machine to OpenBSD 7.3 instead?

/Mikael

#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#74)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

As far as I can tell, the failure to honour large alignment attributes
even though the compiler passes our configure check that you can do
that was considered to be approximately a bug[1] or at least a thing
to be improved in fairly old GCC times but the fix wasn't back-patched
that far. Unfortunately the projects that were allergic to the GPL3
change but wanted to ship a compiler (or some motivation related to
that) got stuck on 4.2 for a while before they flipped to Clang (as
OpenBSD has now done). It seems hard to get excited about doing
anything about that on our side, and those systems are also spewing
other warnings. But if we're going to do it, it looks like the right
place would indeed be a new compiler check that the attribute exists
*and* generates no warnings with alignment > 16, something like that?

I tested this by building gcc 4.2.1 from source on modern Linux
(which was a bit more painful than it ought to be, perhaps)
and building PG with that. It generates no warnings, but nonetheless
gets an exception in CREATE DATABASE:

#2 0x0000000000b64522 in ExceptionalCondition (
conditionName=0xd4fca0 "(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)", fileName=0xd4fbe0 "md.c", lineNumber=468) at assert.c:66
#3 0x00000000009a6b53 in mdextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM,
blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at md.c:468
#4 0x00000000009a9075 in smgrextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM,
blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at smgr.c:500
#5 0x000000000096739c in RelationCopyStorageUsingBuffer (srclocator=...,
dstlocator=..., forkNum=MAIN_FORKNUM, permanent=true) at bufmgr.c:4286
#6 0x0000000000967584 in CreateAndCopyRelationData (src_rlocator=...,
dst_rlocator=..., permanent=true) at bufmgr.c:4361
#7 0x000000000068898e in CreateDatabaseUsingWalLog (src_dboid=1,
dst_dboid=24576, src_tsid=1663, dst_tsid=1663) at dbcommands.c:217
#8 0x000000000068b594 in createdb (pstate=0x1d4a6a8, stmt=0x1d20ec8)
at dbcommands.c:1441

Sure enough, that buffer is a stack local in
RelationCopyStorageUsingBuffer, and it's visibly got a
not-very-well-aligned address.

So apparently, the fact that you even get a warning about the
alignment not being honored is something OpenBSD patched in
after-the-fact; it's not there in genuine vintage gcc.

I get the impression that we are going to need an actual runtime
test if we want to defend against this. Not entirely convinced
it's worth the trouble. Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers? (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

At this point I agree with Andres that it'd be good enough to
silence the warning by getting rid of these alignment pragmas
when the platform lacks O_DIRECT.

regards, tom lane

PS: I don't quite understand how it managed to get through initdb
when CREATE DATABASE doesn't work. Maybe there is a different
code path taken in standalone mode?

#79Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#78)
Re: Direct I/O

On Sun, Apr 16, 2023 at 6:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So apparently, the fact that you even get a warning about the
alignment not being honored is something OpenBSD patched in
after-the-fact; it's not there in genuine vintage gcc.

Ah, that is an interesting discovery, and indeed kills the configure check idea.

At this point I agree with Andres that it'd be good enough to
silence the warning by getting rid of these alignment pragmas
when the platform lacks O_DIRECT.

Hmm. My preferred choice would be: accept Mikael's kind offer to
upgrade curculio to a live version, forget about GCC 4.2.1 forever,
and do nothing. It is a dead parrot.

But if we really want to do something about this, my next preferred
option would be to modify c.h's test to add more conditions, here:

/* GCC, Sunpro and XLC support aligned, packed and noreturn */
#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
#define pg_attribute_aligned(a) __attribute__((aligned(a)))
...

Full GCC support including stack objects actually began in 4.6, it
seems. It might require a bit of research because the GCC-workalikes
including Clang also claim to be certain versions of GCC (for example
I think Clang 7 would be excluded if you excluded GCC 4.2, even though
this particular thing apparently worked fine in Clang 7). That's my
best idea, ie to actually model the feature history accurately, if we
are suspending disbelief and pretending that it is a reasonable
target.

#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#79)
Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

Full GCC support including stack objects actually began in 4.6, it
seems.

Hmm. The oldest gcc versions remaining in the buildfarm seem to be

curculio | configure: using compiler=gcc (GCC) 4.2.1 20070719
frogfish | configure: using compiler=gcc (Debian 4.6.3-14) 4.6.3
lapwing | configure: using compiler=gcc (Debian 4.7.2-5) 4.7.2
skate | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2
snapper | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2
buri | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
chub | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
dhole | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
mantid | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
prion | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
rhinoceros | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
siskin | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
shelduck | configure: using compiler=gcc (SUSE Linux) 4.8.5
topminnow | configure: using compiler=gcc (Debian 4.9.2-10+deb8u1) 4.9.2

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

regards, tom lane

#81Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#78)
Re: Direct I/O

On Sat, Apr 15, 2023 at 02:19:35PM -0400, Tom Lane wrote:

PS: I don't quite understand how it managed to get through initdb
when CREATE DATABASE doesn't work. Maybe there is a different
code path taken in standalone mode?

ad43a413c4f7f5d024a5b2f51e00d280a22f1874
initdb: When running CREATE DATABASE, use STRATEGY = WAL_COPY.

#82Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Tom Lane (#80)
Re: Direct I/O

On 2023-04-16 00:10, Tom Lane wrote:

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

Just let me know if I should switch curculio to OpenBSD 7.3.

I already have a new server setup so only need to switch the "animal"
and "secret" and enable the cron job to get it running.

/Mikael

#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikael Kjellström (#82)
Re: Direct I/O

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:

On 2023-04-16 00:10, Tom Lane wrote:

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

Just let me know if I should switch curculio to OpenBSD 7.3.

Yes please.

I already have a new server setup so only need to switch the "animal"
and "secret" and enable the cron job to get it running.

Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

regards, tom lane

#84Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Tom Lane (#83)
Re: Direct I/O

On 2023-04-16 16:18, Tom Lane wrote:

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:

On 2023-04-16 00:10, Tom Lane wrote:

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

Just let me know if I should switch curculio to OpenBSD 7.3.

Yes please.

Ok.

I already have a new server setup so only need to switch the "animal"
and "secret" and enable the cron job to get it running.

Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

That is what I meant with above.

I just use the same animal name and secret and then run
"update_personality.pl".

That should be enough I think?

/Mikael

#85Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#83)
Re: Direct I/O

On 2023-04-16 Su 10:18, Tom Lane wrote:

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?=<mikael.kjellstrom@mksoft.nu> writes:

On 2023-04-16 00:10, Tom Lane wrote:

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

Just let me know if I should switch curculio to OpenBSD 7.3.

Yes please.

I already have a new server setup so only need to switch the "animal"
and "secret" and enable the cron job to get it running.

Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

update_personality.pl lets you update the OS version / compiler version
/ owner-name / owner-email

I am in fact about to perform this exact operation for prion.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#85)
Re: Direct I/O

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-16 Su 10:18, Tom Lane wrote:

Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

update_personality.pl lets you update the OS version / compiler version
/ owner-name / owner-email

Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
in OpenBSD 7.3, doesn't it? That isn't something update_personality
will handle; you need a new animal if the compiler product is changing.

regards, tom lane

#87Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#86)
Re: Direct I/O

On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-16 Su 10:18, Tom Lane wrote:
Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

update_personality.pl lets you update the OS version / compiler version
/ owner-name / owner-email

Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
in OpenBSD 7.3, doesn't it? That isn't something update_personality
will handle; you need a new animal if the compiler product is changing.

Correct.

Cheers

Andrew

#88Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Andrew Dunstan (#87)
Re: Direct I/O

On 2023-04-16 19:59, Andrew Dunstan wrote:

On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-16 Su 10:18, Tom Lane wrote:
Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

update_personality.pl lets you update the OS version / compiler version
/ owner-name / owner-email

Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
in OpenBSD 7.3, doesn't it? That isn't something update_personality
will handle; you need a new animal if the compiler product is changing.

Correct.

OK. I registered a new animal for this then.

So if someone could look at that and give be an animal name + secret I
can set this up.

/Mikael

#89Michael Paquier
michael@paquier.xyz
In reply to: Mikael Kjellström (#84)
Re: Direct I/O

On Sun, Apr 16, 2023 at 04:51:04PM +0200, Mikael Kjellström wrote:

That is what I meant with above.

I just use the same animal name and secret and then run
"update_personality.pl".

That should be enough I think?

Yes, that should be enough as far as I recall. This has been
mentioned a couple of weeks ago here:
/messages/by-id/CA+hUKGK0jJ+G+bxLUZqpBsxpvEg7Lvt1v8LBxFkZbrvtFTSghw@mail.gmail.com

I have also used setnotes.pl to reflect my animals' CFLAGS on the
website.
--
Michael

#90Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Mikael Kjellström (#88)
Re: Direct I/O

On 2023-04-16 20:05, Mikael Kjellström wrote:

Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
in OpenBSD 7.3, doesn't it?  That isn't something update_personality
will handle; you need a new animal if the compiler product is changing.

Correct.

OK. I registered a new animal for this then.

So if someone could look at that and give be an animal name + secret I
can set this up.

I have setup a new animal "schnauzer" (thanks andrew!).

That should report in a little while.

/Mikael

#91Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#78)
Re: Direct I/O

On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I get the impression that we are going to need an actual runtime
test if we want to defend against this. Not entirely convinced
it's worth the trouble. Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers? (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

I don't think that I fully understand everything under discussion
here, but I would just like to throw in a vote for trying to make
failures as comprehensible as we reasonably can. It makes me a bit
nervous to rely on things like "anybody who has O_DIRECT will also
have working alignment pragmas," because there's no relationship
between those things other than when we think they got implemented on
the platforms that are popular today. If somebody ships me a brand new
Deathstation 9000 that has O_DIRECT but NOT alignment pragmas, how
badly are things going to break and how hard is it going to be for me
to understand why it's not working?

I understand that nobody (including me) wants the code cluttered with
a bunch of useless cruft that caters only to hypothetical systems, and
I don't want us to spend a lot of effort building untestable
infrastructure that caters only to such machines. I just don't want us
to do things that are more magical than they need to be. If and when
something fails, it's real nice if you can easily understand why it
failed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#92Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#91)
Re: Direct I/O

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I get the impression that we are going to need an actual runtime
test if we want to defend against this. Not entirely convinced
it's worth the trouble. Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers? (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

I don't think that I fully understand everything under discussion
here, but I would just like to throw in a vote for trying to make
failures as comprehensible as we reasonably can.

I'm not hugely concerned about this yet. I think the reason for
slipping this into v16 as developer-only code is exactly that we need
to get a feeling for where the portability dragons live. When (and
if) we try to make O_DIRECT mainstream, yes we'd better be sure that
any known failure cases are reported well. But we need the data
about which those are, first.

regards, tom lane

#93Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#92)
Re: Direct I/O

On Tue, Apr 18, 2023 at 4:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I get the impression that we are going to need an actual runtime
test if we want to defend against this. Not entirely convinced
it's worth the trouble. Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers? (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

I don't think that I fully understand everything under discussion
here, but I would just like to throw in a vote for trying to make
failures as comprehensible as we reasonably can.

I'm not hugely concerned about this yet. I think the reason for
slipping this into v16 as developer-only code is exactly that we need
to get a feeling for where the portability dragons live. When (and
if) we try to make O_DIRECT mainstream, yes we'd better be sure that
any known failure cases are reported well. But we need the data
about which those are, first.

+1

A couple more things I wanted to note:

* We have no plans to turn this on by default even when the later
asynchronous machinery is proposed, and direct I/O starts to make more
economic sense (think: your stream of small reads and writes will be
converted to larger preadv/pwritev or moral equivalent and performed
ahead of time in the background). Reasons: (1) There will always be a
few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
learned in this thread; if fails with EINVAL at open() time), and (2)
without a page cache, you really need to size your shared_buffers
adequately and we can't do that automatically. It's something you'd
opt into for a dedicated database server along with other carefully
considered settings. It seems acceptable to me that if you set
io_direct to a non-default setting on an unusual-for-a-database-server
filesystem you might get errors screaming about inability to open
files -- you'll just have to turn it back off again if it doesn't work
for you.

* For the alignment part, C11 has "alignas(x)" in <stdalign.h>, so I
very much doubt that a hypothetical new Deathstation C compiler would
not know how to align stack objects arbitrarily, even though for now
as a C99 program we have to use the non-standard incantations defined
in our c.h. I assume we'll eventually switch to that. In the
meantime, if someone manages to build PostgreSQL on a hypothetical C
compiler that our c.h doesn't recognise, we just won't let you turn
the io_direct GUC on (because we set PG_O_DIRECT to 0 if we don't have
an alignment macro, see commit faeedbce's message for rationale). If
the alignment trick from c.h appears to be available but is actually
broken (GCC 4.2.1), then those assertions I added into smgrread() et
al will fail as Tom showed (yay! they did their job), or in a
non-assert build you'll probably get EINVAL when you try to read or
write from your badly aligned buffers depending on how picky your OS
is, but that's just an old bug in a defunct compiler that we have by
now written more about they ever did in their bug tracker.

* I guess it's unlikely at this point that POSIX will ever standardise
O_DIRECT if they didn't already in the 90s (I didn't find any
discussion of it in their issue tracker). There is really only one OS
on our target list that truly can't do direct I/O at all: OpenBSD. It
seems a reasonable bet that if they or a hypothetical totally new
Unixoid system ever implemented it they'd spell it the same IRIX way
for practical reasons, but if not we just won't use it until someone
writes a patch *shrug*. There is also one system that's been rocking
direct I/O since the 90s for Oracle etc, but PostgreSQL still doesn't
know how to turn it on: Solaris has a directio() system call. I
posted a (trivial) patch for that once in the thread where I added
Apple F_NOCACHE, but there is probably nobody on this list who can
test it successfully (as Tom discovered, wrasse's host is not
configured right for it, you'd need an admin/root to help set up a UFS
file system, or perhaps modern (closed) ZFS can do it but that system
is old and unpatched), and I have no desire to commit a "blind" patch
for an untested niche setup; I really only considered it because I
realised I was so close to covering the complete set of OSes. That's
cool, we just won't let you turn the GUC on if we don't know how and
the error message is clear about that if you try.

#94Greg Stark
stark@mit.edu
In reply to: Thomas Munro (#93)
Re: Direct I/O

On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:

Reasons: (1) There will always be a
few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
learned in this thread; if fails with EINVAL at open() time), and

So why wouldn't we just automatically turn it off (globally or for
that tablespace) and keep operating without it afterward?

(2) without a page cache, you really need to size your shared_buffers
adequately and we can't do that automatically.

Well.... I'm more optimistic... That may not always be impossible.
We've already added the ability to add more shared memory after
startup. We could implement the ability to add or remove shared buffer
segments after startup. And it wouldn't be crazy to imagine a kernel
interface that lets us judge whether the kernel memory pressure makes
it reasonable for us to take more shared buffers or makes it necessary
to release shared memory to the kernel. You could hack something
together using /proc/meminfo today but I imagine an interface intended
for this kind of thing would be better.

It's something you'd
opt into for a dedicated database server along with other carefully
considered settings. It seems acceptable to me that if you set
io_direct to a non-default setting on an unusual-for-a-database-server
filesystem you might get errors screaming about inability to open
files -- you'll just have to turn it back off again if it doesn't work
for you.

If the only solution is to turn it off perhaps the server should just
turn it off? I guess the problem is that the shared_buffers might be
set assuming it would be on?

--
greg

#95Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#94)
Re: Direct I/O

On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote:

Well.... I'm more optimistic... That may not always be impossible.
We've already added the ability to add more shared memory after
startup. We could implement the ability to add or remove shared buffer
segments after startup. And it wouldn't be crazy to imagine a kernel
interface that lets us judge whether the kernel memory pressure makes
it reasonable for us to take more shared buffers or makes it necessary
to release shared memory to the kernel.

On this point specifically, one fairly large problem that we have
currently is that our buffer replacement algorithm is terrible. In
workloads I've examined, either almost all buffers end up with a usage
count of 5 or almost all buffers end up with a usage count of 0 or 1.
Either way, we lose all or nearly all information about which buffers
are actually hot, and we are not especially unlikely to evict some
extremely hot buffer. This is quite bad for performance as it is, and
it would be a lot worse if recovering from a bad eviction decision
routinely required rereading from disk instead of only rereading from
the OS buffer cache.

I've sometimes wondered whether our current algorithm is just a more
expensive version of random eviction. I suspect that's a bit too
pessimistic, but I don't really know.

I'm not saying that it isn't possible to fix this. I bet it is, and I
hope someone does. I'm just making the point that even if we knew the
amount of kernel memory pressure and even if we also had the ability
to add and remove shared_buffers at will, it probably wouldn't help
much as things stand today, because we're not in a good position to
judge how large the cache would need to be in order to be useful, or
what we ought to be storing in it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#96Joe Conway
mail@joeconway.com
In reply to: Robert Haas (#95)
Re: Direct I/O

On 4/19/23 10:11, Robert Haas wrote:

On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote:

Well.... I'm more optimistic... That may not always be impossible.
We've already added the ability to add more shared memory after
startup. We could implement the ability to add or remove shared buffer
segments after startup. And it wouldn't be crazy to imagine a kernel
interface that lets us judge whether the kernel memory pressure makes
it reasonable for us to take more shared buffers or makes it necessary
to release shared memory to the kernel.

On this point specifically, one fairly large problem that we have
currently is that our buffer replacement algorithm is terrible. In
workloads I've examined, either almost all buffers end up with a usage
count of 5 or almost all buffers end up with a usage count of 0 or 1.
Either way, we lose all or nearly all information about which buffers
are actually hot, and we are not especially unlikely to evict some
extremely hot buffer.

That has been my experience as well, although admittedly I have not
looked in quite a while.

I'm not saying that it isn't possible to fix this. I bet it is, and I
hope someone does.

I keep looking at this blog post about Transparent Memory Offloading and
thinking that we could learn from it:

https://engineering.fb.com/2022/06/20/data-infrastructure/transparent-memory-offloading-more-memory-at-a-fraction-of-the-cost-and-power/

Unfortunately, it is very Linux specific and requires a really up to
date OS -- cgroup v2, kernel >= 5.19

I'm just making the point that even if we knew the amount of kernel
memory pressure and even if we also had the ability to add and remove
shared_buffers at will, it probably wouldn't help much as things
stand today, because we're not in a good position to judge how large
the cache would need to be in order to be useful, or what we ought to
be storing in it.

The tactic TMO uses is basically to tune the available memory to get a
target memory pressure. That seems like it could work.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#97Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#95)
Re: Direct I/O

Hi,

On 2023-04-19 10:11:32 -0400, Robert Haas wrote:

On this point specifically, one fairly large problem that we have
currently is that our buffer replacement algorithm is terrible. In
workloads I've examined, either almost all buffers end up with a usage
count of 5 or almost all buffers end up with a usage count of 0 or 1.
Either way, we lose all or nearly all information about which buffers
are actually hot, and we are not especially unlikely to evict some
extremely hot buffer. This is quite bad for performance as it is, and
it would be a lot worse if recovering from a bad eviction decision
routinely required rereading from disk instead of only rereading from
the OS buffer cache.

Interestingly, I haven't seen that as much in more recent benchmarks as it
used to. Partially I think because common s_b settings have gotten bigger, I
guess. But I wonder if we also accidentally improved something else, e.g. by
pin/unpin-ing the same buffer a bit less frequently.

I've sometimes wondered whether our current algorithm is just a more
expensive version of random eviction. I suspect that's a bit too
pessimistic, but I don't really know.

I am quite certain that it's better than that. If you e.g. have pkey lookup
workload >> RAM you can actually end up seeing inner index pages staying
reliably in s_b. But clearly we can do better.

I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
information to the buffer replacement layer:

Independent of the concrete buffer replacement algorithm, the recency
information we do provide is somewhat lacking. In some places we do repeated
ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

We should seriously consider using the cost of the IO into account, basically
making it more likely that s_b is increased when we need to synchronously wait
for IO. The cost of a miss is much lower for e.g. a sequential scan or a
bitmap heap scan, because both can do some form of prefetching. Whereas index
pages and the heap fetch for plain index scans aren't prefetchable (which
could be improved some, but not generally).

I'm not saying that it isn't possible to fix this. I bet it is, and I
hope someone does. I'm just making the point that even if we knew the
amount of kernel memory pressure and even if we also had the ability
to add and remove shared_buffers at will, it probably wouldn't help
much as things stand today, because we're not in a good position to
judge how large the cache would need to be in order to be useful, or
what we ought to be storing in it.

FWIW, my experience is that linux' page replacement doesn't work very well
either. Partially because we "hide" a lot of the recency information from
it. But also just because it doesn't scale all that well to large amounts of
memory (there's ongoing work on that though). So I am not really convinced by
this argument - for plenty workloads just caching in PG will be far better
than caching both in the kernel and in PG, as long as some adaptiveness to
memory pressure avoids running into OOMs.

Some forms of adaptive s_b sizing aren't particularly hard, I think. Instead
of actually changing the s_b shmem allocation - which would be very hard in a
process based model - we can tell the kernel that some parts of that memory
aren't currently in use with madvise(MADV_REMOVE). It's not quite as trivial
as it sounds, because we'd have to free in multiple of huge_page_size.

Greetings,

Andres Freund

#98Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#93)
Re: Direct I/O

Hi,

On 2023-04-18 09:44:10 +1200, Thomas Munro wrote:

* We have no plans to turn this on by default even when the later
asynchronous machinery is proposed, and direct I/O starts to make more
economic sense (think: your stream of small reads and writes will be
converted to larger preadv/pwritev or moral equivalent and performed
ahead of time in the background). Reasons: (1) There will always be a
few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
learned in this thread; if fails with EINVAL at open() time), and (2)
without a page cache, you really need to size your shared_buffers
adequately and we can't do that automatically. It's something you'd
opt into for a dedicated database server along with other carefully
considered settings. It seems acceptable to me that if you set
io_direct to a non-default setting on an unusual-for-a-database-server
filesystem you might get errors screaming about inability to open
files -- you'll just have to turn it back off again if it doesn't work
for you.

FWIW, *long* term I think it might sense to turn DIO on automatically for a
small subset of operations, if supported. Examples:

1) Once we have the ability to "feed" walsenders from wal_buffers, instead of
going to disk, automatically using DIO for WAL might be beneficial. The
increase in IO concurrency and reduction in latency one can get is
substantial.

2) If we make base backups use s_b if pages are in s_b, and do locking via s_b
for non-existing pages, it might be worth automatically using DIO for the
reads of the non-resident data, to avoid swamping the kernel page cache
with data that won't be read again soon (and to utilize DMA etc).

3) When writing back dirty data that we don't expect to be dirtied again soon,
e.g. from vacuum ringbuffers or potentially even checkpoints, it could make
sense to use DIO, to avoid the kernel keeping such pages in the page cache.

But for the main s_b, I agree, I can't forsee us turning on DIO by
default. Unless somebody has tuned s_b at least some for the workload, that's
not going to go well. And even if somebody has, it's quite reasonable to use
the same host also for other programs (including other PG instances), in which
case it's likely desirable to be adaptive to the current load when deciding
what to cache - which the kernel is in the best position to do.

If the alignment trick from c.h appears to be available but is actually
broken (GCC 4.2.1), then those assertions I added into smgrread() et
al will fail as Tom showed (yay! they did their job), or in a
non-assert build you'll probably get EINVAL when you try to read or
write from your badly aligned buffers depending on how picky your OS
is, but that's just an old bug in a defunct compiler that we have by
now written more about they ever did in their bug tracker.

Agreed. If we ever find such issues in a postmordial compiler, we'll just need
to beef up our configure test to detect that it doesn't actually fully support
specifying alignment.

Greetings,

Andres Freund

#99Andres Freund
andres@anarazel.de
In reply to: Greg Stark (#94)
Re: Direct I/O

Hi,

On 2023-04-18 15:35:09 -0400, Greg Stark wrote:

On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:

It's something you'd
opt into for a dedicated database server along with other carefully
considered settings. It seems acceptable to me that if you set
io_direct to a non-default setting on an unusual-for-a-database-server
filesystem you might get errors screaming about inability to open
files -- you'll just have to turn it back off again if it doesn't work
for you.

If the only solution is to turn it off perhaps the server should just
turn it off? I guess the problem is that the shared_buffers might be
set assuming it would be on?

I am quite strongly opposed to that - silently (or with a log message, which
practically is the same as silently) disabling performance relevant options
like DIO is much more likely to cause problems, due to the drastically
different performance characteristics you get. I can see us making it
configurable to try using DIO though, but I am not convinced it's worth
bothering with that. But we'll see.

Greetings,

Andres Freund

#100Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#97)
Re: Direct I/O

On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:

Interestingly, I haven't seen that as much in more recent benchmarks as it
used to. Partially I think because common s_b settings have gotten bigger, I
guess. But I wonder if we also accidentally improved something else, e.g. by
pin/unpin-ing the same buffer a bit less frequently.

I think the problem with the algorithm is pretty fundamental. The rate
of usage count increase is tied to how often we access buffers, and
the rate of usage count decrease is tied to buffer eviction. But a
given workload can have no eviction at all (in which case the usage
counts must converge to 5) or it can evict on every buffer access (in
which case the usage counts must mostly converget to 0, because we'll
be decreasing usage counts at least once per buffer and generally
more). ISTM that the only way that you can end up with a good mix of
usage counts is if you have a workload that falls into some kind of a
sweet spot where the rate of usage count bumps and the rate of usage
count de-bumps are close enough together that things don't skew all
the way to one end or the other. Bigger s_b might make that more
likely to happen in practice, but it seems like bad algorithm design
on a theoretical level. We should be comparing the frequency of access
of buffers to the frequency of access of other buffers, not to the
frequency of buffer eviction. Or to put the same thing another way,
the average value of the usage count shouldn't suddenly change from 5
to 1 when the server evicts 1 buffer.

I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
information to the buffer replacement layer:

Independent of the concrete buffer replacement algorithm, the recency
information we do provide is somewhat lacking. In some places we do repeated
ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

Yeah, that would be good to fix. I don't think it solves the whole
problem by itself, but it seems like a good step.

We should seriously consider using the cost of the IO into account, basically
making it more likely that s_b is increased when we need to synchronously wait
for IO. The cost of a miss is much lower for e.g. a sequential scan or a
bitmap heap scan, because both can do some form of prefetching. Whereas index
pages and the heap fetch for plain index scans aren't prefetchable (which
could be improved some, but not generally).

I guess that's reasonable if we can pass the information around well
enough, but I still think the algorithm ought to get some fundamental
improvement first.

FWIW, my experience is that linux' page replacement doesn't work very well
either. Partially because we "hide" a lot of the recency information from
it. But also just because it doesn't scale all that well to large amounts of
memory (there's ongoing work on that though). So I am not really convinced by
this argument - for plenty workloads just caching in PG will be far better
than caching both in the kernel and in PG, as long as some adaptiveness to
memory pressure avoids running into OOMs.

Even if the Linux algorithm is bad, and it may well be, the Linux
cache is often a lot bigger than our cache. Which can cover a
multitude of problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

#101Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#100)
Re: Direct I/O

Hi,

On 2023-04-19 13:16:54 -0400, Robert Haas wrote:

On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:

Interestingly, I haven't seen that as much in more recent benchmarks as it
used to. Partially I think because common s_b settings have gotten bigger, I
guess. But I wonder if we also accidentally improved something else, e.g. by
pin/unpin-ing the same buffer a bit less frequently.

I think the problem with the algorithm is pretty fundamental. The rate
of usage count increase is tied to how often we access buffers, and
the rate of usage count decrease is tied to buffer eviction. But a
given workload can have no eviction at all (in which case the usage
counts must converge to 5) or it can evict on every buffer access (in
which case the usage counts must mostly converget to 0, because we'll
be decreasing usage counts at least once per buffer and generally
more).

I don't think the "evict on every buffer access" works quite that way - unless
you have a completely even access pattern, buffer access frequency will
increase usage count much more frequently on some buffers than others. And if
you have a completely even access pattern, it's hard to come up with a good
cache replacement algorithm...

ISTM that the only way that you can end up with a good mix of
usage counts is if you have a workload that falls into some kind of a
sweet spot where the rate of usage count bumps and the rate of usage
count de-bumps are close enough together that things don't skew all
the way to one end or the other. Bigger s_b might make that more
likely to happen in practice, but it seems like bad algorithm design
on a theoretical level. We should be comparing the frequency of access
of buffers to the frequency of access of other buffers, not to the
frequency of buffer eviction. Or to put the same thing another way,
the average value of the usage count shouldn't suddenly change from 5
to 1 when the server evicts 1 buffer.

I agree that there are fundamental issues with the algorithm. But practically
I think the effect of the over-saturation of s_b isn't as severe as one might
think:

If your miss rate is very low, the occasional bad victim buffer selection
won't matter that much. If the miss rate is a bit higher, the likelihood of
the usagecount being increased again after being decreased is higher if a
buffer is accessed frequently.

This is also why I think that larger s_b makes the issues less likely - with
larger s_b, it is more likely that frequently accessed buffers are accessed
again after the first of the 5 clock sweeps necessary to reduce the usage
count. Clearly, with a small-ish s_b and a high replacement rate, that's not
going to happen for sufficiently many buffers. But once you have a few GB of
s_b, multiple complete sweeps take a while.

Most, if not all, buffer replacement algorithms I have seen, don't deal well
with "small SB with a huge replacement rate". Most of the fancier algorithms
track recency information for buffers that have recently been evicted - but
you obviously can't track that to an unlimited degree, IIRC most papers
propose that the shadow map to be roughly equal to the buffer pool size.

You IMO pretty much need a policy decision on a higher level to improve upon
that (e.g. by just deciding that some buffers are sticky, perhaps because they
were used first) - but it doesn't matter much, because the miss rate is high
enough that the total amount of reads is barely affected by the buffer
replacement decisions.

Greetings,

Andres Freund

#102Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#92)
Re: Direct I/O

On Mon, Apr 17, 2023 at 12:06:23PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I get the impression that we are going to need an actual runtime
test if we want to defend against this. Not entirely convinced
it's worth the trouble. Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers? (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

I don't think that I fully understand everything under discussion
here, but I would just like to throw in a vote for trying to make
failures as comprehensible as we reasonably can.

I'm not hugely concerned about this yet. I think the reason for
slipping this into v16 as developer-only code is exactly that we need
to get a feeling for where the portability dragons live.

Speaking of the developer-only status, I find the io_direct name more enticing
than force_parallel_mode, which PostgreSQL renamed due to overuse from people
expecting non-developer benefits. Should this have a name starting with
debug_?

#103Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#102)
Re: Direct I/O

On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:

Speaking of the developer-only status, I find the io_direct name more enticing
than force_parallel_mode, which PostgreSQL renamed due to overuse from people
expecting non-developer benefits. Should this have a name starting with
debug_?

Hmm, yeah I think people coming from other databases would be tempted
by it. But, unlike the
please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
switch, I think of this thing more like an experimental feature that
is just waiting for more features to make it useful. What about a
warning message about that at startup if it's on?

#104Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#103)
1 attachment(s)
Re: Direct I/O

On Sun, Apr 30, 2023 at 6:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:

Speaking of the developer-only status, I find the io_direct name more enticing
than force_parallel_mode, which PostgreSQL renamed due to overuse from people
expecting non-developer benefits. Should this have a name starting with
debug_?

Hmm, yeah I think people coming from other databases would be tempted
by it. But, unlike the
please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
switch, I think of this thing more like an experimental feature that
is just waiting for more features to make it useful. What about a
warning message about that at startup if it's on?

Something like this? Better words welcome.

$ ~/install//bin/postgres -D pgdata -c io_direct=data
2023-05-01 09:44:37.460 NZST [99675] LOG: starting PostgreSQL 16devel
on x86_64-unknown-freebsd13.2, compiled by FreeBSD clang version
14.0.5 (https://github.com/llvm/llvm-project.git
llvmorg-14.0.5-0-gc12386ae247c), 64-bit
2023-05-01 09:44:37.460 NZST [99675] LOG: listening on IPv6 address
"::1", port 5432
2023-05-01 09:44:37.460 NZST [99675] LOG: listening on IPv4 address
"127.0.0.1", port 5432
2023-05-01 09:44:37.461 NZST [99675] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2023-05-01 09:44:37.463 NZST [99675] WARNING: io_direct is an
experimental setting for developer testing only
2023-05-01 09:44:37.463 NZST [99675] HINT: File I/O may be
inefficient or not work on some file systems.
2023-05-01 09:44:37.465 NZST [99678] LOG: database system was shut
down at 2023-05-01 09:43:51 NZST
2023-05-01 09:44:37.468 NZST [99675] LOG: database system is ready to
accept connections

Attachments:

0001-Log-a-warning-about-io_direct-at-startup-time.patchtext/x-patch; charset=US-ASCII; name=0001-Log-a-warning-about-io_direct-at-startup-time.patchDownload
From a9005129c939a8298c6668645588b2a8ef5064b6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 1 May 2023 09:46:35 +1200
Subject: [PATCH] Log a warning about io_direct at startup time.

We've documented io_direct as an experimental developer-only feature,
but that documentation might be hard to find.  Let's also display a
warning about that in the server log.

Later proposals will provide the infrastructure to use it efficiently,
but by releasing this switch earlier we can learn about direct I/O
quirks in systems in the wild.

Discussion: https://postgr.es/m/20230430041106.GA2268796%40rfd.leadboat.com

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c49393fc5..8f5c03fa46 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1432,6 +1432,12 @@ PostmasterMain(int argc, char *argv[])
 				 errhint("Set the LC_ALL environment variable to a valid locale.")));
 #endif
 
+	/* Temporary warning about experimental status of direct I/O support. */
+	if (io_direct_flags != 0)
+		ereport(WARNING,
+				(errmsg("io_direct is an experimental setting for developer testing only"),
+				 errhint("File I/O may be inefficient or not work on some file systems.")));
+
 	/*
 	 * Remember postmaster startup time
 	 */
-- 
2.40.1

#105Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#103)
Re: Direct I/O

On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:

On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:

Speaking of the developer-only status, I find the io_direct name more enticing
than force_parallel_mode, which PostgreSQL renamed due to overuse from people
expecting non-developer benefits. Should this have a name starting with
debug_?

Hmm, yeah I think people coming from other databases would be tempted
by it. But, unlike the
please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
switch, I think of this thing more like an experimental feature that
is just waiting for more features to make it useful. What about a
warning message about that at startup if it's on?

Such a warning wouldn't be particularly likely to be seen by someone who
already didn't read/understand the docs for the not-feature that they
turned on.

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.
That avoids the issue that random advice to enable direct_io=x under
v17+ is applied by people running v16. +0.8 to do so.

Maybe in the future, it should be added to GUC_EXPLAIN, too ?

--
Justin

#106Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#105)
Re: Direct I/O

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:

What about a
warning message about that at startup if it's on?

Such a warning wouldn't be particularly likely to be seen by someone who
already didn't read/understand the docs for the not-feature that they
turned on.

Yeah, I doubt that that would be helpful at all.

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.

+1

regards, tom lane

#107Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#106)
1 attachment(s)
Re: Direct I/O

On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:

What about a
warning message about that at startup if it's on?

Such a warning wouldn't be particularly likely to be seen by someone who
already didn't read/understand the docs for the not-feature that they
turned on.

Yeah, I doubt that that would be helpful at all.

Fair enough.

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.

+1

Yeah, the future cross-version confusion thing is compelling. OK,
here's a rename patch. I left all the variable names and related
symbols as they were, so it's just the GUC that gains the prefix. I
moved the documentation hunk up to be sorted alphabetically like
nearby entries, because that seemed to look nicer, even though the
list isn't globally sorted.

Attachments:

0001-Rename-io_direct-to-debug_io_direct.patchtext/x-patch; charset=US-ASCII; name=0001-Rename-io_direct-to-debug_io_direct.patchDownload
From f95fc08c47c7d429950e8f22ce93c35271adfb37 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 1 May 2023 14:03:30 +1200
Subject: [PATCH] Rename io_direct to debug_io_direct.

Give the new GUC introduced by d4e71df6 a name that is clearly not
intended for mainstream use quite yet.

Future proposals would drop the prefix only after adding infrastructure
to make it efficient.  Having the switch in the tree sooner is good
because it might lead to new discoveries about the hazards awaiting us
on a wide range of systems, but that name was too enticing and could
lead to cross-version confusion in future, per complaints from Noah and
Justin.

Suggested-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20230430041106.GA2268796%40rfd.leadboat.com

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b56f073a91..4f4b5b0b74 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11159,6 +11159,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-debug-io-direct" xreflabel="debug_io_direct">
+      <term><varname>debug_io_direct</varname> (<type>string</type>)
+      <indexterm>
+        <primary><varname>debug_io_direct</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Ask the kernel to minimize caching effects for relation data and WAL
+        files using <literal>O_DIRECT</literal> (most Unix-like systems),
+        <literal>F_NOCACHE</literal> (macOS) or
+        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+       </para>
+       <para>
+        May be set to an empty string (the default) to disable use of direct
+        I/O, or a comma-separated list of operations that should use direct I/O.
+        The valid options are <literal>data</literal> for
+        main data files, <literal>wal</literal> for WAL files, and
+        <literal>wal_init</literal> for WAL files when being initially
+        allocated.
+       </para>
+       <para>
+        Some operating systems and file systems do not support direct I/O, so
+        non-default settings may be rejected at startup or cause errors.
+       </para>
+       <para>
+        Currently this feature reduces performance, and is intended for
+        developer testing only.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-debug-parallel-query" xreflabel="debug_parallel_query">
       <term><varname>debug_parallel_query</varname> (<type>enum</type>)
       <indexterm>
@@ -11220,38 +11252,6 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-io-direct" xreflabel="io_direct">
-      <term><varname>io_direct</varname> (<type>string</type>)
-      <indexterm>
-        <primary><varname>io_direct</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Ask the kernel to minimize caching effects for relation data and WAL
-        files using <literal>O_DIRECT</literal> (most Unix-like systems),
-        <literal>F_NOCACHE</literal> (macOS) or
-        <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
-       </para>
-       <para>
-        May be set to an empty string (the default) to disable use of direct
-        I/O, or a comma-separated list of operations that should use direct I/O.
-        The valid options are <literal>data</literal> for
-        main data files, <literal>wal</literal> for WAL files, and
-        <literal>wal_init</literal> for WAL files when being initially
-        allocated.
-       </para>
-       <para>
-        Some operating systems and file systems do not support direct I/O, so
-        non-default settings may be rejected at startup or cause errors.
-       </para>
-       <para>
-        Currently this feature reduces performance, and is intended for
-        developer testing only.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
       <term><varname>post_auth_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2f42cebaf6..9f9cf7ce66 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4568,7 +4568,7 @@ struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+		{"debug_io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Use direct I/O for file access."),
 			NULL,
 			GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
index b8814bb640..dddcfb1aa9 100644
--- a/src/test/modules/test_misc/t/004_io_direct.pl
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -40,7 +40,7 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq{
-io_direct = 'data,wal,wal_init'
+debug_io_direct = 'data,wal,wal_init'
 shared_buffers = '256kB' # tiny to force I/O
 wal_level = replica # minimal runs out of shared_buffers when set so tiny
 });
-- 
2.39.2

#108Thomas Munro
thomas.munro@gmail.com
In reply to: Greg Stark (#94)
Re: Direct I/O

On Wed, Apr 19, 2023 at 7:35 AM Greg Stark <stark@mit.edu> wrote:

On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:

(2) without a page cache, you really need to size your shared_buffers
adequately and we can't do that automatically.

Well.... I'm more optimistic... That may not always be impossible.
We've already added the ability to add more shared memory after
startup. We could implement the ability to add or remove shared buffer
segments after startup.

Yeah, there are examples of systems going back decades with multiple
buffer pools. In some you can add more space later, and in some you
can also configure pools with different block sizes (imagine if you
could set your extremely OLTP tables to use 4KB blocks for reduced
write amplification and then perhaps even also promise that your
storage doesn't need FPIs for that size because you know it's
perfectly safe™, and imagine if you could set some big write-only
history tables to use 32KB blocks because some compression scheme
works better, etc), and you might also want different cache
replacement algorithms in different pools. Complex and advanced stuff
no doubt and I'm not suggesting that's anywhere near a reasonable
thing for us to think about now (as a matter of fact in another thread
you can find me arguing for fully unifying our existing segregated
SLRU buffer pools with the one true buffer pool), but since we're
talking pie-in-the-sky ideas around the water cooler...

#109Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#107)
Re: Direct I/O

On Mon, May 01, 2023 at 02:47:57PM +1200, Thomas Munro wrote:

On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.

+1

Yeah, the future cross-version confusion thing is compelling. OK,
here's a rename patch.

This looks reasonable.

#110Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#109)
Re: Direct I/O

On Mon, May 15, 2023 at 9:09 AM Noah Misch <noah@leadboat.com> wrote:

This looks reasonable.

Pushed with a small change: a couple of GUC_check_errdetail strings
needed s/io_direct/debug_io_direct/. Thanks.

#111Peter Eisentraut
peter@eisentraut.org
In reply to: Thomas Munro (#107)
1 attachment(s)
Re: Direct I/O

On 01.05.23 04:47, Thomas Munro wrote:

On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:

What about a
warning message about that at startup if it's on?

Such a warning wouldn't be particularly likely to be seen by someone who
already didn't read/understand the docs for the not-feature that they
turned on.

Yeah, I doubt that that would be helpful at all.

Fair enough.

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.

+1

Yeah, the future cross-version confusion thing is compelling. OK,
here's a rename patch. I left all the variable names and related
symbols as they were, so it's just the GUC that gains the prefix. I
moved the documentation hunk up to be sorted alphabetically like
nearby entries, because that seemed to look nicer, even though the
list isn't globally sorted.

I suggest to also rename the hook functions (check and assign), like in
the attached patch. Mainly because utils/guc_hooks.h says to order the
functions by GUC variable name, which was already wrong under the old
name, but it would be pretty confusing to sort the functions by their
GUC name that doesn't match the function names.

Attachments:

0001-Rename-hook-functions-for-debug_io_direct-to-match-v.patchtext/plain; charset=UTF-8; name=0001-Rename-hook-functions-for-debug_io_direct-to-match-v.patchDownload
From b549b5972410f28a375325e09f022f703afc12ab Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Aug 2023 14:12:45 +0200
Subject: [PATCH] Rename hook functions for debug_io_direct to match variable
 name

---
 src/backend/storage/file/fd.c       | 6 +++---
 src/backend/utils/misc/guc_tables.c | 6 +++---
 src/include/utils/guc_hooks.h       | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3c2a2fbef7..16b3e8f905 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3886,7 +3886,7 @@ data_sync_elevel(int elevel)
 }
 
 bool
-check_io_direct(char **newval, void **extra, GucSource source)
+check_debug_io_direct(char **newval, void **extra, GucSource source)
 {
 	bool		result = true;
 	int			flags;
@@ -3960,7 +3960,7 @@ check_io_direct(char **newval, void **extra, GucSource source)
 	if (!result)
 		return result;
 
-	/* Save the flags in *extra, for use by assign_io_direct */
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
 	*extra = guc_malloc(ERROR, sizeof(int));
 	*((int *) *extra) = flags;
 
@@ -3968,7 +3968,7 @@ check_io_direct(char **newval, void **extra, GucSource source)
 }
 
 extern void
-assign_io_direct(const char *newval, void *extra)
+assign_debug_io_direct(const char *newval, void *extra)
 {
 	int		   *flags = (int *) extra;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 15b51f2c5b..45b93fe0f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static char *datestyle_string;
 static char *server_encoding_string;
 static char *server_version_string;
 static int	server_version_num;
-static char *io_direct_string;
+static char *debug_io_direct_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4544,9 +4544,9 @@ struct config_string ConfigureNamesString[] =
 			NULL,
 			GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
 		},
-		&io_direct_string,
+		&debug_io_direct_string,
 		"",
-		check_io_direct, assign_io_direct, NULL
+		check_debug_io_direct, assign_debug_io_direct, NULL
 	},
 
 	/* End-of-list marker */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2ecb9fc086..952293a1c3 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -49,6 +49,8 @@ extern bool check_cluster_name(char **newval, void **extra, GucSource source);
 extern const char *show_data_directory_mode(void);
 extern bool check_datestyle(char **newval, void **extra, GucSource source);
 extern void assign_datestyle(const char *newval, void *extra);
+extern bool check_debug_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_debug_io_direct(const char *newval, void *extra);
 extern bool check_default_table_access_method(char **newval, void **extra,
 											  GucSource source);
 extern bool check_default_tablespace(char **newval, void **extra,
@@ -157,7 +159,5 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
 										   GucSource source);
 extern void assign_wal_consistency_checking(const char *newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
-extern bool check_io_direct(char **newval, void **extra, GucSource source);
-extern void assign_io_direct(const char *newval, void *extra);
 
 #endif							/* GUC_HOOKS_H */
-- 
2.41.0

In reply to: Andres Freund (#101)
Re: Direct I/O

On Wed, Apr 19, 2023 at 10:43 AM Andres Freund <andres@anarazel.de> wrote:

I don't think the "evict on every buffer access" works quite that way - unless
you have a completely even access pattern, buffer access frequency will
increase usage count much more frequently on some buffers than others. And if
you have a completely even access pattern, it's hard to come up with a good
cache replacement algorithm...

My guess is that the most immediate problem in this area is the
problem of "correlated references" (to use the term from the LRU-K
paper). I gave an example of that here:

/messages/by-id/CAH2-Wzk7T9K3d9_NY+jEXp2qQGMYoP=gZMoR8q1Cv57SxAw1OA@mail.gmail.com

In other words, I think that the most immediate problem may in fact be
the tendency of usage_count to get incremented multiple times in
response to what is (for all intents and purposes) the same logical
page access. Even if it's not as important as I imagine, it still
seems likely that verifying that our input information isn't garbage
is the logical place to begin work in this general area. It's
difficult to be sure about that because it's so hard to look at just
one problem in isolation. I suspect that you were right to point out
that a larger shared buffers tends to look quite different to a
smaller shared buffers. That same factor is going to complicate any
analysis of the specific problem that I've highlighted (to say nothing
of the way that contention complicates the picture).

There is an interesting paper that compared the hit rates seen for
TPC-C to TPC-E on relatively modern hardware:

https://www.cs.cmu.edu/~chensm/papers/TPCE-sigmod-record10.pdf

It concludes that the buffer misses for each workload look rather
similar, past a certain point (past a certain buffer pool size): both
workloads have cache misses that seem totally random. The access
patterns may be very different, but that doesn't necessarily have any
visible effect on buffer misses. At least provided that you make
certain modest assumptions about buffer pool size, relative to working
set size.

The most sophisticated cache management algorithms (like ARC) work by
maintaining metadata about recently evicted buffers, which is used to
decide whether to favor recency over frequency. If you work backwards
then it follows that having cache misses that look completely random
is desirable, and perhaps even something to work towards. What you
really don't want is a situation where the same small minority of
pages keep getting ping-ponged into and out of the buffer pool,
without ever settling, even though the buffer cache is large enough
that that's possible in principle. That pathological profile is the
furthest possible thing from random.

With smaller shared_buffers, it's perhaps inevitable that buffer cache
misses are random, and so I'd expect that managing the problem of
contention will tend to matter most. With larger shared_buffers it
isn't inevitable at all, so the quality of the cache eviction scheme
is likely to matter quite a bit more.

--
Peter Geoghegan

#113Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Eisentraut (#111)
Re: Direct I/O

On Wed, Aug 23, 2023 at 12:15 AM Peter Eisentraut <peter@eisentraut.org> wrote:

I suggest to also rename the hook functions (check and assign), like in
the attached patch. Mainly because utils/guc_hooks.h says to order the
functions by GUC variable name, which was already wrong under the old
name, but it would be pretty confusing to sort the functions by their
GUC name that doesn't match the function names.

OK. I'll push this tomorrow unless you do it while I'm asleep. Thanks!