Direct I/O
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]https://wiki.postgresql.org/wiki/AIO. It adds three new settings, defaulting to off:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation
O_DIRECT asks the kernel to avoid caching file data as much as
possible. Here's a fun quote about it[2]https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics:
"The exact semantics of Direct I/O (O_DIRECT) are not well specified.
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour has
generally been passed down as oral lore rather than as a formal set of
requirements and specifications."
It gives the kernel the opportunity to move data directly between
PostgreSQL's user space buffers and the storage hardware using DMA
hardware, that is, without CPU involvement or copying. Not all
storage stacks can do that, for various reasons, but even if not, the
caching policy should ideally still use temporary buffers and avoid
polluting the page cache.
These settings currently destroy performance, and are not intended to
be used by end-users, yet! That's why we filed them under
DEVELOPER_OPTIONS. You don't get automatic read-ahead, concurrency,
clustering or (of course) buffering from the kernel. The idea is that
later parts of the AIO patch-set will introduce mechanisms to replace
what the kernel is doing for us today, and then more, since we ought
to be even better at predicting our own future I/O than it, so that
we'll finish up ahead. Even with all that, you wouldn't want to turn
it on by default because the default shared_buffers would be
insufficient for any real system, and there are portability problems.
Examples of slowness:
* every 8KB sequential read or write becomes a full round trip to the
storage, one at a time
* data that is written to WAL and then read back in by WAL sender will
incur full I/O round trip (that's probably not really an AIO problem,
that's something we should probably address by using shared memory
instead of files, as noted as a TODO item in the source code)
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.
Assorted portability notes:
I expect this to "work" (that is, successfully destroy performance) on
typical developer systems running at least Linux, macOS, Windows and
FreeBSD. By work, I mean: not be rejected by PostgreSQL, not be
rejected by the kernel, and influence kernel cache behaviour on common
filesystems. It might be rejected with ENOSUPP, EINVAL etc on some
more exotic filesystems and OSes. Of currently supported OSes, only
OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the
GUCs. For macOS and Windows we internally translate our own
PG_O_DIRECT flag to the correct flags/calls (committed a while
back[3]/messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com).
On Windows, scatter/gather is available only with direct I/O, so a
true pwritev would in theory be possible, but that has some more
complications and is left for later patches (probably using native
interfaces, not disguising as POSIX).
There may be systems on which 8KB offset alignment will not work at
all or not work well, and that's expected. For example, BTRFS, ZFS,
JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an
8KB write will have to trigger a read-before-write. Note that
offset/length alignment requirements (blocks) are independent of
buffer alignment requirements (memory pages, 4KB).
The behaviour and cache coherency of files that have open descriptors
using both direct and non-direct flags may be complicated and vary
between systems. The patch currently lets you change the GUCs at
runtime so backends can disagree: that should probably not be allowed,
but is like that now for experimentation. More study is required.
If someone has a compiler that we don't know how to do
pg_attribute_aligned() for, then we can't make correctly aligned stack
buffers, so in that case direct I/O is disabled, but I don't know of
such a system (maybe aCC, but we dropped it). That's why smgr code
can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and
why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned()
macro, disabling the GUCs.
This seems to be an independent enough piece to get into the tree on
its own, with the proviso that it's not actually useful yet other than
for experimentation. Thoughts?
These patches have been hacked on at various times by Andres Freund,
David Rowley and me.
[1]: https://wiki.postgresql.org/wiki/AIO
[2]: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
[3]: /messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com
Attachments:
0001-Align-PGAlignedBlock-to-expected-page-size.patchtext/x-patch; charset=US-ASCII; name=0001-Align-PGAlignedBlock-to-expected-page-size.patchDownload
From 87a0c14600506d2a33a5a6bedc6e58d70ff7acc7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 24 Jun 2020 16:35:49 -0700
Subject: [PATCH 1/3] Align PGAlignedBlock to expected page size.
In order to be allowed to use O_DIRECT, we need to align buffers to the
page or sector size.
Author: Andres Freund <andres@anarazel.de>
Author: Thomas Munro <thomas.munro@gmail.com>
---
src/include/c.h | 20 ++++++++++++--------
src/include/pg_config_manual.h | 8 ++++++++
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/src/include/c.h b/src/include/c.h
index d70ed84ac5..0deaca0414 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1070,17 +1070,18 @@ extern void ExceptionalCondition(const char *conditionName,
/*
* Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes. Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware. (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.) We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page or passed to
+ * an I/O function and not just a string of bytes. Otherwise the variable
+ * might be under-aligned, causing problems on alignment-picky hardware, or if
+ * PG_O_DIRECT is used. We include both "double" and "int64" in the union to
+ * ensure that the compiler knows the value must be MAXALIGN'ed (cf.
+ * configure's computation of MAXIMUM_ALIGNOF).
*/
typedef union PGAlignedBlock
{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
char data[BLCKSZ];
double force_align_d;
int64 force_align_i64;
@@ -1089,6 +1090,9 @@ typedef union PGAlignedBlock
/* Same, but for an XLOG_BLCKSZ-sized buffer */
typedef union PGAlignedXLogBlock
{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
char data[XLOG_BLCKSZ];
double force_align_d;
int64 force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..a2ad08a110 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,14 @@
*/
#define PG_CACHE_LINE_SIZE 128
+/*
+ * Assumed memory alignment requirement for direct I/O. The real requirement
+ * may be based on sectors or pages. The default is the typical modern sector
+ * size and virtual memory page size, which is enough for currently known
+ * systems.
+ */
+#define PG_IO_ALIGN_SIZE 4096
+
/*
*------------------------------------------------------------------------
* The following symbols are for enabling debugging code, not for
--
2.35.1
0002-XXX-palloc_io_aligned-not-for-review-here.patchtext/x-patch; charset=US-ASCII; name=0002-XXX-palloc_io_aligned-not-for-review-here.patchDownload
From 7a1521dcafbc42b2482d16e8dd0781dfbd5ef2b4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Oct 2022 09:47:45 -0700
Subject: [PATCH 2/3] XXX palloc_io_aligned() -- not for review here
This patch will be posted for review by David Rowley in its own thread,
but a copy is included here as a dependency.
---
contrib/bloom/blinsert.c | 2 +-
src/backend/access/gist/gistbuild.c | 8 +-
src/backend/access/gist/gistbuildbuffers.c | 5 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/access/spgist/spginsert.c | 2 +-
src/backend/nodes/gen_node_support.pl | 2 +-
src/backend/storage/buffer/buf_init.c | 7 +-
src/backend/storage/buffer/localbuf.c | 4 +-
src/backend/storage/page/bufpage.c | 2 +-
src/backend/storage/smgr/md.c | 14 ++-
src/backend/utils/mmgr/mcxt.c | 99 ++++++++++++++++++++--
src/include/nodes/memnodes.h | 5 +-
src/include/utils/memutils_internal.h | 4 +-
src/include/utils/palloc.h | 5 ++
16 files changed, 141 insertions(+), 30 deletions(-)
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..b0da3ac529 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
BloomFillMetapage(index, metapage);
/*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..2daa9b2e10 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
* Write an empty page as a placeholder for the root page. It will be
* replaced with the real root page at the end.
*/
- page = palloc0(BLCKSZ);
+ page = palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
page, true);
state->pages_allocated++;
@@ -509,7 +509,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
levelstate->current_page++;
if (levelstate->pages[levelstate->current_page] == NULL)
- levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+ levelstate->pages[levelstate->current_page] = palloc_io_aligned(BLCKSZ, 0);
newPage = levelstate->pages[levelstate->current_page];
gistinitpage(newPage, old_page_flags);
@@ -579,7 +579,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
/* Create page and copy data */
data = (char *) (dist->list);
- target = palloc0(BLCKSZ);
+ target = (Page) palloc_io_aligned(BLCKSZ, 0);
gistinitpage(target, isleaf ? F_LEAF : 0);
for (int i = 0; i < dist->block.num; i++)
{
@@ -630,7 +630,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
if (parent == NULL)
{
parent = palloc0(sizeof(GistSortedBuildLevelState));
- parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->pages[0] = (Page) palloc_io_aligned(BLCKSZ, 0);
parent->parent = NULL;
gistinitpage(parent->pages[0], 0);
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 538e3880c9..9e188633ae 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -186,8 +186,9 @@ gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
{
GISTNodeBufferPage *pageBuffer;
- pageBuffer = (GISTNodeBufferPage *) MemoryContextAllocZero(gfbb->context,
- BLCKSZ);
+ pageBuffer = (GISTNodeBufferPage *)
+ MemoryContextAllocIOAligned(gfbb->context,
+ BLCKSZ, MCXT_ALLOC_ZERO);
pageBuffer->prev = InvalidBlockNumber;
/* Set page free space */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index b01b39b008..6fe7f1aed4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -257,7 +257,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_old_rel = old_heap;
state->rs_new_rel = new_heap;
- state->rs_buffer = (Page) palloc(BLCKSZ);
+ state->rs_buffer = (Page) palloc_io_aligned(BLCKSZ, 0);
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..924da953aa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..563e6cce1f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
Page page;
BTPageOpaque opaque;
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_io_aligned(BLCKSZ, 0);
/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
while (blkno > wstate->btws_pages_written)
{
if (!wstate->btws_zeropage)
- wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ wstate->btws_zeropage =
+ (Page) palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
+
/* don't set checksum for all-zero page */
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_io_aligned(BLCKSZ, 0);
_bt_initmetapage(metapage, rootblkno, rootlevel,
wstate->inskey->allequalimage);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..d5b83710e4 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
Page page;
/* Construct metapage. */
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_io_aligned(BLCKSZ, 0);
SpGistInitMetapage(page);
/*
diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index 81b8c184a9..9598056821 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -142,7 +142,7 @@ my @abstract_types = qw(Node);
# they otherwise don't participate in node support.
my @extra_tags = qw(
IntList OidList XidList
- AllocSetContext GenerationContext SlabContext
+ AllocSetContext GenerationContext SlabContext AlignedAllocRedirectContext
TIDBitmap
WindowObjectData
);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..edd9bd48c3 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -79,8 +79,9 @@ InitBufferPool(void)
&foundDescs);
BufferBlocks = (char *)
- ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ, &foundBufs);
+ TYPEALIGN(BLCKSZ,
+ ShmemInitStruct("Buffer Blocks",
+ (NBuffers + 1) * (Size) BLCKSZ, &foundBufs));
/* Align condition variables to cacheline boundary. */
BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -164,6 +165,8 @@ BufferShmemSize(void)
size = add_size(size, PG_CACHE_LINE_SIZE);
/* size of data pages */
+ /* to allow aligning buffer blocks */
+ size = add_size(size, BLCKSZ);
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..f51d3527f6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,8 @@ GetLocalBufferStorage(void)
/* And don't overflow MaxAllocSize, either */
num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
- cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
- num_bufs * BLCKSZ);
+ cur_block = (char *) MemoryContextAllocIOAligned(LocalBufferContext,
+ num_bufs * BLCKSZ, 0);
next_buf_in_block = 0;
num_bufs_in_block = num_bufs;
}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..42f6f1782a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,7 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
* and second to avoid wasting space in processes that never call this.
*/
if (pageCopy == NULL)
- pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+ pageCopy = MemoryContextAllocIOAligned(TopMemoryContext, BLCKSZ, 0);
memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index a515bb36ac..719721a894 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -439,6 +439,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+#if PG_O_DIRECT != 0
+ AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
@@ -661,6 +665,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+#if PG_O_DIRECT != 0
+ AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rlocator.locator.spcOid,
reln->smgr_rlocator.locator.dbOid,
@@ -726,6 +734,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+#if PG_O_DIRECT != 0
+ AssertPointerAlignment(buffer, PG_IO_ALIGN_SIZE);
+#endif
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum < mdnblocks(reln, forknum));
@@ -1280,7 +1292,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
*/
if (nblocks < ((BlockNumber) RELSEG_SIZE))
{
- char *zerobuf = palloc0(BLCKSZ);
+ char *zerobuf = palloc_io_aligned(BLCKSZ, MCXT_ALLOC_ZERO);
mdextend(reln, forknum,
nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index f526ca82c1..807c0f3af3 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -36,6 +36,9 @@ static void BogusFree(void *pointer);
static void *BogusRealloc(void *pointer, Size size);
static MemoryContext BogusGetChunkContext(void *pointer);
static Size BogusGetChunkSpace(void *pointer);
+static void AlignedAllocFree(void *pointer);
+static MemoryContext AlignedAllocGetChunkContext(void *pointer);
+
/*****************************************************************************
* GLOBAL MEMORY *
@@ -84,6 +87,10 @@ static const MemoryContextMethods mcxt_methods[] = {
[MCTX_SLAB_ID].check = SlabCheck,
#endif
+ /* in here */
+ [MCTX_ALIGNED_REDIRECT_ID].get_chunk_context = AlignedAllocGetChunkContext,
+ [MCTX_ALIGNED_REDIRECT_ID].free_p = AlignedAllocFree,
+
/*
* Unused (as yet) IDs should have dummy entries here. This allows us to
* fail cleanly if a bogus pointer is passed to pfree or the like. It
@@ -110,11 +117,6 @@ static const MemoryContextMethods mcxt_methods[] = {
[MCTX_UNUSED4_ID].realloc = BogusRealloc,
[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
-
- [MCTX_UNUSED5_ID].free_p = BogusFree,
- [MCTX_UNUSED5_ID].realloc = BogusRealloc,
- [MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
- [MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
};
/*
@@ -1306,11 +1308,16 @@ void
pfree(void *pointer)
{
#ifdef USE_VALGRIND
+ MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
MemoryContext context = GetMemoryChunkContext(pointer);
#endif
MCXT_METHOD(pointer, free_p) (pointer);
- VALGRIND_MEMPOOL_FREE(context, pointer);
+
+#ifdef USE_VALGRIND
+ if (method != MCTX_ALIGNED_REDIRECT_ID)
+ VALGRIND_MEMPOOL_FREE(context, pointer);
+#endif
}
/*
@@ -1497,3 +1504,83 @@ pchomp(const char *in)
n--;
return pnstrdup(in, n);
}
+
+/*
+ * pointer to fake memory context + pointer to actual allocation
+ */
+#define ALIGNED_ALLOC_CHUNK_SIZE (sizeof(uintptr_t) + sizeof(uintptr_t))
+
+#include "utils/memutils_memorychunk.h"
+
+static void
+AlignedAllocFree(void *pointer)
+{
+ MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+ void *unaligned;
+
+ Assert(!MemoryChunkIsExternal(chunk));
+
+ unaligned = MemoryChunkGetBlock(chunk);
+
+ pfree(unaligned);
+}
+
+MemoryContext
+AlignedAllocGetChunkContext(void *pointer)
+{
+ MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+
+ Assert(!MemoryChunkIsExternal(chunk));
+
+ return GetMemoryChunkContext(MemoryChunkGetBlock(chunk));
+}
+
+void *
+MemoryContextAllocAligned(MemoryContext context,
+ Size size, Size alignto, int flags)
+{
+ Size alloc_size;
+ void *unaligned;
+ void *aligned;
+
+ /* wouldn't make much sense to waste that much space */
+ Assert(alignto < (128 * 1024 * 1024));
+
+ if (alignto < MAXIMUM_ALIGNOF)
+ return palloc_extended(size, flags);
+
+ /* allocate enough space for alignment padding */
+ alloc_size = size + alignto + sizeof(MemoryChunk);
+
+ unaligned = MemoryContextAllocExtended(context, alloc_size, flags);
+
+ aligned = (char *) unaligned + sizeof(MemoryChunk);
+ aligned = (void *) (TYPEALIGN(alignto, aligned) - sizeof(MemoryChunk));
+
+ MemoryChunkSetHdrMask(aligned, unaligned, 0, MCTX_ALIGNED_REDIRECT_ID);
+
+ /* XXX: should we adjust valgrind state here? */
+
+ Assert((char *) TYPEALIGN(alignto, MemoryChunkGetPointer(aligned)) == MemoryChunkGetPointer(aligned));
+
+ return MemoryChunkGetPointer(aligned);
+}
+
+void *
+MemoryContextAllocIOAligned(MemoryContext context, Size size, int flags)
+{
+ // FIXME: don't hardcode page size
+ return MemoryContextAllocAligned(context, size, 4096, flags);
+}
+
+void *
+palloc_aligned(Size size, Size alignto, int flags)
+{
+ return MemoryContextAllocAligned(CurrentMemoryContext, size, alignto, flags);
+}
+
+void *
+palloc_io_aligned(Size size, int flags)
+{
+ return MemoryContextAllocIOAligned(CurrentMemoryContext, size, flags);
+}
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index 63d07358cd..dcfe41806a 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -104,10 +104,11 @@ typedef struct MemoryContextData
*
* Add new context types to the set accepted by this macro.
*/
-#define MemoryContextIsValid(context) \
+#define MemoryContextIsValid(context) \
((context) != NULL && \
(IsA((context), AllocSetContext) || \
IsA((context), SlabContext) || \
- IsA((context), GenerationContext)))
+ IsA((context), GenerationContext) || \
+ IsA((context), AlignedAllocRedirectContext)))
#endif /* MEMNODES_H */
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index bc2cbdd506..9611a192a2 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -92,8 +92,8 @@ typedef enum MemoryContextMethodID
MCTX_ASET_ID,
MCTX_GENERATION_ID,
MCTX_SLAB_ID,
- MCTX_UNUSED4_ID, /* available */
- MCTX_UNUSED5_ID /* 111 occurs in wipe_mem'd memory */
+ MCTX_ALIGNED_REDIRECT_ID,
+ MCTX_UNUSED4_ID /* 111 occurs in wipe_mem'd memory */
} MemoryContextMethodID;
/*
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 8eee0e2938..0b0ba2a953 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -73,10 +73,15 @@ extern void *MemoryContextAllocZero(MemoryContext context, Size size);
extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size);
extern void *MemoryContextAllocExtended(MemoryContext context,
Size size, int flags);
+extern void *MemoryContextAllocAligned(MemoryContext context,
+ Size size, Size alignto, int flags);
+extern void *MemoryContextAllocIOAligned(MemoryContext context, Size size, int flags);
extern void *palloc(Size size);
extern void *palloc0(Size size);
extern void *palloc_extended(Size size, int flags);
+extern void *palloc_aligned(Size size, Size alignto, int flags);
+extern void *palloc_io_aligned(Size size, int flags);
extern pg_nodiscard void *repalloc(void *pointer, Size size);
extern pg_nodiscard void *repalloc_extended(void *pointer,
Size size, int flags);
--
2.35.1
0003-Add-direct-I-O-settings-developer-only.patchtext/x-patch; charset=US-ASCII; name=0003-Add-direct-I-O-settings-developer-only.patchDownload
From 819a406f029b04ab6a500f63fe9c154332b65d8e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 3 Oct 2022 21:58:22 -0700
Subject: [PATCH 3/3] Add direct I/O settings (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files. This hurts performance currently and is not
intended for end-users yet. Later proposed work will introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.
This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.
Discussion: https://postgr.es/m/
Author: Andres Freund <andres@anarazel.de>
Author: Thomas Munro <thomas.munro@gmail.com>
---
doc/src/sgml/config.sgml | 51 ++++++++++++++++++++
src/backend/access/transam/xlog.c | 53 +++++++++++++--------
src/backend/access/transam/xlogprefetcher.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 13 +++--
src/backend/storage/buffer/localbuf.c | 4 +-
src/backend/storage/file/fd.c | 5 ++
src/backend/storage/smgr/md.c | 29 +++++++++--
src/backend/storage/smgr/smgr.c | 20 ++++++++
src/backend/utils/misc/guc_tables.c | 33 +++++++++++++
src/include/access/xlog.h | 2 +
src/include/storage/fd.h | 6 ++-
src/include/storage/smgr.h | 5 ++
src/include/utils/guc_hooks.h | 2 +
13 files changed, 190 insertions(+), 35 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 559eb898a9..2d860dd900 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11011,6 +11011,57 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-io-data-direct" xreflabel="io_data_direct">
+ <term><varname>io_data_direct</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>io_data_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects for relation data files
+ using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows). Currently this
+ hurts performance, and is intended for developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-io-wal-direct" xreflabel="io_wal_direct">
+ <term><varname>io_wal_direct</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>io_wal_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects while writing WAL files
+ using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows). Currently this
+ hurts performance, and is intended for developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-io-wal-init-direct" xreflabel="io_wal_init_direct">
+ <term><varname>io_wal_init_direct</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>io_wal_init_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects while initializing WAL files
+ using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows). Currently this
+ hurts performance, and is intended for developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
<term><varname>post_auth_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..5663bdf856 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -138,6 +138,8 @@ int wal_retrieve_retry_interval = 5000;
int max_slot_wal_keep_size_mb = -1;
int wal_decode_buffer_size = 512 * 1024;
bool track_wal_io_timing = false;
+bool io_wal_direct = false;
+bool io_wal_init_direct = false;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -2926,6 +2928,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
XLogSegNo max_segno;
int fd;
int save_errno;
+ int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
Assert(logtli != 0);
@@ -2958,8 +2961,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
unlink(tmppath);
+ if (io_wal_init_direct)
+ open_flags |= PG_O_DIRECT;
+
/* do not use get_sync_bit() here --- want to fsync only at end of fill */
- fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = BasicOpenFile(tmppath, open_flags);
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -3373,7 +3379,7 @@ XLogFileClose(void)
* use the cache to read the WAL segment.
*/
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
- if (!XLogIsNeeded())
+ if (!XLogIsNeeded() && !io_wal_direct)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
@@ -4473,6 +4479,21 @@ show_in_hot_standby(void)
return RecoveryInProgress() ? "on" : "off";
}
+/*
+ * GUC check for direct I/O support.
+ */
+bool
+check_io_wal_direct(bool *newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+ if (*newval)
+ {
+ GUC_check_errdetail("io_wal_direct and io_wal_init_direct are not supported on this platform.");
+ return false;
+ }
+#endif
+ return true;
+}
/*
* Read the control file, set respective GUCs.
@@ -8056,35 +8077,27 @@ xlog_redo(XLogReaderState *record)
}
/*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_wal_direct.
*/
static int
get_sync_bit(int method)
{
int o_direct_flag = 0;
- /* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
- return 0;
-
/*
- * Optimize writes by bypassing kernel cache with O_DIRECT when using
- * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
- * otherwise the archive command or walsender process will read the WAL
- * soon after writing it, which is guaranteed to cause a physical read if
- * we bypassed the kernel cache. We also skip the
- * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
- * reason.
- *
- * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+ * Use O_DIRECT if requested, except in walreceiver process. The WAL
* written by walreceiver is normally read by the startup process soon
- * after it's written. Also, walreceiver performs unaligned writes, which
+ * after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
- if (!XLogIsNeeded() && !AmWalReceiverProcess())
+ if (io_wal_direct && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
+ /* If fsync is disabled, never open in sync mode */
+ if (!enableFsync)
+ return o_direct_flag;
+
switch (method)
{
/*
@@ -8096,7 +8109,7 @@ get_sync_bit(int method)
case SYNC_METHOD_FSYNC:
case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
- return 0;
+ return o_direct_flag;
#ifdef O_SYNC
case SYNC_METHOD_OPEN:
return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..d840078afc 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
block->prefetch_buffer = InvalidBuffer;
return LRQ_NEXT_IO;
}
- else
+ else if (!io_data_direct)
{
/*
* This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..9918855f37 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,7 +535,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* Try to initiate an asynchronous read. This returns false in
* recovery if the relation file doesn't exist.
*/
- if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ if (!io_data_direct && smgrprefetch(smgr_reln, forkNum, blockNum))
result.initiated_io = true;
#endif /* USE_PREFETCH */
}
@@ -582,11 +582,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* the kernel and therefore didn't really initiate I/O, and no way to know when
* the I/O completes other than using synchronous ReadBuffer().
*
- * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and
* USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery. (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), io_data_direct is enabled, or the underlying
+ * relation file wasn't found and we are in recovery. (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
*/
PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4908,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
{
PendingWriteback *pending;
+ if (io_data_direct)
+ return;
+
/*
* Add buffer to the pending writeback array, unless writeback control is
* disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f51d3527f6..f9c82a789e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,8 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
{
#ifdef USE_PREFETCH
/* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
- result.initiated_io = true;
+ if (!io_data_direct && smgrprefetch(smgr, forkNum, blockNum))
+ result.initiated_io = true;
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec5..aa720952f8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2021,6 +2021,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
if (nbytes <= 0)
return;
+#ifdef PG_O_DIRECT
+ if (VfdCache[file].fileFlags & PG_O_DIRECT)
+ return;
+#endif
+
returnCode = FileAccess(file);
if (returnCode < 0)
return;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 719721a894..20ec37c310 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,21 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+ int flags = O_RDWR | PG_BINARY;
+
+ /*
+ * XXX: not clear if direct IO ever is interesting for other forks? The
+ * FSM fork currently often ends up very fragmented when using direct IO,
+ * for example.
+ */
+ if (io_data_direct /* && forkNum == MAIN_FORKNUM */)
+ flags |= PG_O_DIRECT;
+
+ return flags;
+}
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +220,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
if (fd < 0)
{
int save_errno = errno;
if (isRedo)
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
/* be sure to report the error reported by create, not open */
@@ -513,7 +528,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
@@ -584,6 +599,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
off_t seekpos;
MdfdVec *v;
+ Assert(!io_data_direct);
+
v = _mdfd_getseg(reln, forknum, blocknum, false,
InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
if (v == NULL)
@@ -609,6 +626,8 @@ void
mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks)
{
+ Assert(!io_data_direct);
+
/*
* Issue flush requests in as few requests as possible; have to split at
* segment boundaries though, since those are actually separate files.
@@ -1186,7 +1205,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
fullpath = _mdfd_segpath(reln, forknum, segno);
/* open the file */
- fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+ fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
pfree(fullpath);
@@ -1395,7 +1414,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
strlcpy(path, p, MAXPGPATH);
pfree(p);
- file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
if (file < 0)
return -1;
need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..706a52b9f1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
#include "storage/bufmgr.h"
+#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/md.h"
#include "storage/smgr.h"
@@ -27,6 +28,9 @@
#include "utils/inval.h"
+/* GUCs */
+bool io_data_direct = false;
+
/*
* This struct of function pointers defines the API between smgr.c and
* any individual storage manager module. Note that smgr subfunctions are
@@ -735,3 +739,19 @@ ProcessBarrierSmgrRelease(void)
smgrreleaseall();
return true;
}
+
+/*
+ * Check if this build allows smgr implementations to enable direct I/O.
+ */
+bool
+check_io_data_direct(bool *newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+ if (*newval)
+ {
+ GUC_check_errdetail("io_data_direct is not supported on this platform.");
+ return false;
+ }
+#endif
+ return true;
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..e324378ad4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1925,6 +1925,39 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"io_data_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+ gettext_noop("Access data files with direct I/O."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &io_data_direct,
+ false,
+ check_io_data_direct, NULL, NULL
+ },
+
+ {
+ {"io_wal_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+ gettext_noop("Write WAL files with direct I/O."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &io_wal_direct,
+ false,
+ check_io_wal_direct, NULL, NULL
+ },
+
+ {
+ {"io_wal_init_direct", PGC_SUSET, DEVELOPER_OPTIONS,
+ gettext_noop("Initialize WAL files with direct I/O."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &io_wal_init_direct,
+ false,
+ check_io_wal_direct, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..6220370036 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -51,6 +51,8 @@ extern PGDLLIMPORT char *wal_consistency_checking_string;
extern PGDLLIMPORT bool log_checkpoints;
extern PGDLLIMPORT bool track_wal_io_timing;
extern PGDLLIMPORT int wal_decode_buffer_size;
+extern PGDLLIMPORT bool io_wal_direct;
+extern PGDLLIMPORT bool io_wal_init_direct;
extern PGDLLIMPORT int CheckPointSegments;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d..283ff21e31 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,7 @@
#define FD_H
#include <dirent.h>
+#include <fcntl.h>
typedef enum RecoveryInitSyncMethod
{
@@ -82,9 +83,10 @@ extern PGDLLIMPORT int max_safe_fds;
* to the appropriate Windows flag in src/port/open.c. We simulate it with
* fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper. We use the name
* PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix). We can only use it if the compiler will correctly align
+ * PGAlignedBlock for us, though.
*/
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
#define PG_O_DIRECT O_DIRECT
#elif defined(F_NOCACHE)
#define PG_O_DIRECT 0x80000000
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ef75934a16 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,10 @@
#include "lib/ilist.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "utils/guc.h"
+
+/* GUCs */
+extern PGDLLIMPORT bool io_data_direct;
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -107,5 +111,6 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
+extern bool check_io_data_direct(bool *newval, void **extra, GucSource source);
#endif /* SMGR_H */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..a9748f6b34 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -59,6 +59,8 @@ extern bool check_effective_io_concurrency(int *newval, void **extra,
GucSource source);
extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
extern const char *show_in_hot_standby(void);
+extern bool check_io_data_direct(bool *newval, void **extra, GucSource source);
+extern bool check_io_wal_direct(bool *newval, void **extra, GucSource source);
extern bool check_locale_messages(char **newval, void **extra, GucSource source);
extern void assign_locale_messages(const char *newval, void *extra);
extern bool check_locale_monetary(char **newval, void **extra, GucSource source);
--
2.35.1
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]. It adds three new settings, defaulting to off:io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation
You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment
--
Justin
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)
Interesting idea. So "direct_io = data, wal, wal_init", or maybe that
should be spelled io_direct. ("Direct I/O" is a common term of art,
but we also have some more io_XXX GUCs in later patches, so it's hard
to choose...)
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.
Good idea, will do.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment
Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...
Hi,
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.
In the past more complicated GUCs have not been well received, but it does
seem like a nice way to reduce the amount of redundant stuff.
Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?
(By "better", I mean reducing the number of new GUCs - which is less
important for developer GUCs anyway.)
FWIW, if / once we get to actual AIO, at least some of these would stop being
developer-only GUCs. There's substantial performance benefits in using DIO
with AIO. Buffered IO requires the CPU to copy the data from the userspace
into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more
useful work. Buffered IO tops out much much earlier than AIO + DIO, and
unfortunately tops out at much lower speeds on server CPUs.
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.Good idea, will do.
Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignmentOh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...
It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Greetings,
Andres Freund
Hi,
On 2022-11-01 15:54:02 -0700, Andres Freund wrote:
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Although it might actually be worth aligning fully everywhere - there's a
noticable performance difference for buffered read IO.
I benchmarked this on my workstation and laptop.
I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then
measured performance of reading 8192 bytes into the buffer at different
offsets. Each time I copied 16GiB in total. Within a program invocation I
benchmarked each offset 4 times, threw away the worst measurement, and
averaged the rest. Then used the best of three program invocations.
workstation with dual xeon Gold 5215:
turbo on turbo off
offset GiB/s GiB/s
0 18.358 13.528
8 15.361 11.472
9 15.330 11.418
32 17.583 13.097
512 17.707 13.229
513 15.890 11.852
4096 18.176 13.568
8192 18.088 13.566
2Mib 18.658 13.496
laptop with i9-9880H:
turbo on turbo off
offset GiB/s GiB/s
0 33.589 17.160
8 28.045 14.301
9 27.582 14.318
32 31.797 16.711
512 32.215 16.810
513 28.864 14.932
4096 32.503 17.266
8192 32.871 17.277
2Mib 32.657 17.262
Seems pretty clear that using 4096 byte alignment is worth it.
Greetings,
Andres Freund
On 11/1/22 2:36 AM, Thomas Munro wrote:
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:
1) work_mem. This is a significant impediment to scaling shared buffers
the way you'd want to.
2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large
systems with 75% of memory going to shared_buffers, that becomes a very
significant problem, especially when the backend running the clock sweep
is doing so in order to perform an operation like a b-tree page split. I
suspect it shouldn't be too hard to deal with this issue by just having
bgwriter or another bgworker proactively ensuring some reasonable number
of buffers with usage_count=0 exist.
One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.
Hi,
On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:
On 11/1/22 2:36 AM, Thomas Munro wrote:
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:1) work_mem. This is a significant impediment to scaling shared buffers the
way you'd want to.
I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.
2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large systems
with 75% of memory going to shared_buffers, that becomes a very significant
problem, especially when the backend running the clock sweep is doing so in
order to perform an operation like a b-tree page split. I suspect it
shouldn't be too hard to deal with this issue by just having bgwriter or
another bgworker proactively ensuring some reasonable number of buffers with
usage_count=0 exist.
I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.
I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.
I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...
One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.
Which would be a very bad idea for now.... Thomas does have a patch for moving
them into the main buffer pool.
Greetings,
Andres Freund
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.
One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi,
On 2022-11-10 14:26:20 +0700, John Naylor wrote:
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.
With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.
The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.
Greetings,
Andres Freund
On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.
Done as io_direct=data,wal,wal_init. Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.
Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?
Makes sense. The only tricky question was where to store the GUC. I
went for fd.c for now, but it doesn't seem quite right...
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.Good idea, will do.
Done. The tests take 2-3x as long depending on the OS.
Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.
Done.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignmentOh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there. We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed. For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).
V2 comprises:
0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on
Attachments:
v2-0001-Add-allocator-support-for-larger-allocation-align.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-allocator-support-for-larger-allocation-align.patchDownload
From 9af1dcc3ce36ce18e011183d5f2a97cdc07fe396 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Oct 2022 09:47:45 -0700
Subject: [PATCH v2 1/4] Add allocator support for larger allocation alignment
& use for IO
---
src/backend/utils/cache/catcache.c | 5 +-
src/backend/utils/mmgr/Makefile | 1 +
src/backend/utils/mmgr/alignedalloc.c | 110 ++++++++++++++++++
src/backend/utils/mmgr/mcxt.c | 141 +++++++++++++++++++++--
src/backend/utils/mmgr/meson.build | 1 +
src/include/utils/memutils_internal.h | 13 ++-
src/include/utils/memutils_memorychunk.h | 2 +-
src/include/utils/palloc.h | 3 +
8 files changed, 263 insertions(+), 13 deletions(-)
create mode 100644 src/backend/utils/mmgr/alignedalloc.c
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 30ef0ba39c..9e635177c8 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -763,7 +763,6 @@ InitCatCache(int id,
{
CatCache *cp;
MemoryContext oldcxt;
- size_t sz;
int i;
/*
@@ -807,8 +806,8 @@ InitCatCache(int id,
*
* Note: we rely on zeroing to initialize all the dlist headers correctly
*/
- sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
- cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
+ cp = (CatCache *) palloc_aligned(sizeof(CatCache), PG_CACHE_LINE_SIZE,
+ MCXT_ALLOC_ZERO);
cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
/*
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index 3b4cfdbd52..dae3432c98 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = \
+ alignedalloc.o \
aset.o \
dsa.o \
freepage.o \
diff --git a/src/backend/utils/mmgr/alignedalloc.c b/src/backend/utils/mmgr/alignedalloc.c
new file mode 100644
index 0000000000..97cb1d2b0d
--- /dev/null
+++ b/src/backend/utils/mmgr/alignedalloc.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * alignedalloc.c
+ * Allocator functions to implement palloc_aligned
+ *
+ * This is not a fully fledged MemoryContext type as there is no means to
+ * create a MemoryContext of this type. The code here only serves to allow
+ * operations such as pfree() and repalloc() to work correctly on a memory
+ * chunk that was allocated by palloc_aligned().
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/alignedalloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/memdebug.h"
+#include "utils/memutils_memorychunk.h"
+
+void
+AlignedAllocFree(void *pointer)
+{
+ MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+ void *unaligned;
+
+#ifdef MEMORY_CONTEXT_CHECKING
+ /*
+ * Test for someone scribbling on unused space in chunk. We don't have
+ * the ability to include the context name here, so just mention that it's
+ * an aligned chunk.
+ */
+ if (!sentinel_ok(pointer, chunk->requested_size))
+ elog(WARNING, "detected write past %zu-byte aligned chunk end at %p",
+ MemoryChunkGetValue(chunk), chunk);
+#endif
+
+ Assert(!MemoryChunkIsExternal(chunk));
+
+ /* obtain the original (unaligned) allocated pointer */
+ unaligned = MemoryChunkGetBlock(chunk);
+
+ pfree(unaligned);
+}
+
+void *
+AlignedAllocRealloc(void *pointer, Size size)
+{
+ MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer);
+ Size alignto = MemoryChunkGetValue(redirchunk);
+ void *unaligned = MemoryChunkGetBlock(redirchunk);
+ MemoryContext ctx;
+ Size old_size;
+ void *newptr;
+
+ /* sanity check this is a power of 2 value */
+ Assert((alignto & (alignto - 1)) == 0);
+
+ /*
+ * Determine the size of the original allocation. We can't determine this
+ * exactly as GetMemoryChunkSpace() returns the total space used for the
+ * allocation, which for contexts like aset includes rounding up to the
+ * next power of 2. However, this value is just used to memcpy() the old
+ * data into the new allocation, so we only need to concern ourselves with
+ * not reading beyond the end of the original allocation's memory. The
+ * drawback here is that we may copy more bytes than we need to, which
+ * amounts only to wasted effort.
+ */
+#ifndef MEMORY_CONTEXT_CHECKING
+ old_size = GetMemoryChunkSpace(unaligned) -
+ ((char *) pointer - (char *) PointerGetMemoryChunk(unaligned));
+#else
+ old_size = redirchunk->requested_size;
+#endif
+
+ ctx = GetMemoryChunkContext(unaligned);
+ newptr = MemoryContextAllocAligned(ctx, size, alignto, 0);
+
+ /*
+ * We may memcpy beyond the end of the orignal allocation request size, so
+ * we must mark the entire allocation as defined.
+ */
+ VALGRIND_MAKE_MEM_DEFINED(pointer, old_size);
+ memcpy(newptr, pointer, Min(size, old_size));
+ pfree(unaligned);
+
+ return newptr;
+}
+
+MemoryContext
+AlignedAllocGetChunkContext(void *pointer)
+{
+ MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
+
+ Assert(!MemoryChunkIsExternal(chunk));
+
+ return GetMemoryChunkContext(MemoryChunkGetBlock(chunk));
+}
+
+Size
+AlignedGetChunkSpace(void *pointer)
+{
+ MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer);
+ void *unaligned = MemoryChunkGetBlock(redirchunk);
+
+ return GetMemoryChunkSpace(unaligned);
+}
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 57bd6690ca..c1e3e88b49 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -30,6 +30,7 @@
#include "utils/memdebug.h"
#include "utils/memutils.h"
#include "utils/memutils_internal.h"
+#include "utils/memutils_memorychunk.h"
static void BogusFree(void *pointer);
@@ -84,6 +85,21 @@ static const MemoryContextMethods mcxt_methods[] = {
[MCTX_SLAB_ID].check = SlabCheck,
#endif
+ /* alignedalloc.c */
+ [MCTX_ALIGNED_REDIRECT_ID].alloc = NULL, /* not required */
+ [MCTX_ALIGNED_REDIRECT_ID].free_p = AlignedAllocFree,
+ [MCTX_ALIGNED_REDIRECT_ID].realloc = AlignedAllocRealloc,
+ [MCTX_ALIGNED_REDIRECT_ID].reset = NULL, /* not required */
+ [MCTX_ALIGNED_REDIRECT_ID].delete_context = NULL, /* not required */
+ [MCTX_ALIGNED_REDIRECT_ID].get_chunk_context = AlignedAllocGetChunkContext,
+ [MCTX_ALIGNED_REDIRECT_ID].get_chunk_space = AlignedGetChunkSpace,
+ [MCTX_ALIGNED_REDIRECT_ID].is_empty = NULL, /* not required */
+ [MCTX_ALIGNED_REDIRECT_ID].stats = NULL, /* not required */
+#ifdef MEMORY_CONTEXT_CHECKING
+ [MCTX_ALIGNED_REDIRECT_ID].check = NULL, /* not required */
+#endif
+
+
/*
* Unused (as yet) IDs should have dummy entries here. This allows us to
* fail cleanly if a bogus pointer is passed to pfree or the like. It
@@ -110,11 +126,6 @@ static const MemoryContextMethods mcxt_methods[] = {
[MCTX_UNUSED4_ID].realloc = BogusRealloc,
[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
-
- [MCTX_UNUSED5_ID].free_p = BogusFree,
- [MCTX_UNUSED5_ID].realloc = BogusRealloc,
- [MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
- [MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
};
/*
@@ -1298,6 +1309,111 @@ palloc_extended(Size size, int flags)
return ret;
}
+/*
+ * MemoryContextAllocAligned
+ * Allocate 'size' bytes of memory in 'context' aligned to 'alignto'
+ * bytes.
+ *
+ * 'alignto' must be a power of 2.
+ * 'flags' may be 0 or set the same as MemoryContextAllocExtended().
+ */
+void *
+MemoryContextAllocAligned(MemoryContext context,
+ Size size, Size alignto, int flags)
+{
+ MemoryChunk *alignedchunk;
+ Size alloc_size;
+ void *unaligned;
+ void *aligned;
+
+ /* wouldn't make much sense to waste that much space */
+ Assert(alignto < (128 * 1024 * 1024));
+
+ /* ensure alignto is a power of 2 */
+ Assert((alignto & (alignto - 1)) == 0);
+
+ /*
+ * If the alignment requirements are less than what we already guarantee
+ * then just use the standard allocation function.
+ */
+ if (unlikely(alignto <= MAXIMUM_ALIGNOF))
+ return MemoryContextAllocExtended(context, size, flags);
+
+ /*
+ * We implement aligned pointers by simply allocating enough memory for
+ * the requested size plus the alignment and an additional "redirection"
+ * MemoryChunk. This additional MemoryChunk is required for operations
+ * such as pfree when used on the pointer returned by this function. We
+ * use this redirection MemoryChunk in order to find the pointer to the
+ * memory that was returned by the MemoryContextAllocExtended call below.
+ * We do that by "borrowing" the block offset field and instead of using
+ * that to find the offset into the owning block, we use it to find the
+ * original allocated address.
+ *
+ * Here we must allocate enough extra memory so that we can still align
+ * the pointer returned by MemoryContextAllocExtended and also have enough
+ * space for the redirection MemoryChunk. Since allocations will already
+ * be at least aligned by MAXIMUM_ALIGNOF, we can subtract that amount
+ * from the allocation size to save a little memory.
+ */
+ alloc_size = size + alignto + sizeof(MemoryChunk) - MAXIMUM_ALIGNOF;
+
+#ifdef MEMORY_CONTEXT_CHECKING
+ /* ensure there's space for a sentinal byte */
+ alloc_size += 1;
+#endif
+
+ /* perform the actual allocation */
+ unaligned = MemoryContextAllocExtended(context, alloc_size, flags);
+
+ /* set the aligned pointer */
+ aligned = (void *) TYPEALIGN(alignto, (char *) unaligned +
+ sizeof(MemoryChunk));
+
+ alignedchunk = PointerGetMemoryChunk(aligned);
+
+ /*
+ * We set the redirect MemoryChunk so that the block offset calculation is
+ * used to point back to the 'unaligned' allocated chunk. This allows us
+ * to use MemoryChunkGetBlock() to find the unaligned chunk when we need
+ * to perform operations such as pfree() and repalloc().
+ *
+ * We store 'alignto' in the MemoryChunk's 'value' so that we know what
+ * the alignment was set to should we ever be asked to realloc this
+ * pointer.
+ */
+ MemoryChunkSetHdrMask(alignedchunk, unaligned, alignto,
+ MCTX_ALIGNED_REDIRECT_ID);
+
+ /* double check we produced a correctly aligned pointer */
+ Assert((char *) TYPEALIGN(alignto, aligned) == aligned);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+ alignedchunk->requested_size = size;
+ /* set mark to catch clobber of "unused" space */
+ set_sentinel(aligned, size);
+#endif
+
+ /* Mark the bytes before the redirection header as noaccess */
+ VALGRIND_MAKE_MEM_NOACCESS(unaligned,
+ (char *) alignedchunk - (char *) unaligned);
+ return aligned;
+}
+
+/*
+ * palloc_aligned
+ * Allocate 'size' bytes returning a pointer that's aligned to the
+ * 'alignto' boundary.
+ *
+ * 'alignto' must be a power of 2.
+ * 'flags' may be 0 or set the same as MemoryContextAllocExtended().
+ */
+void *
+palloc_aligned(Size size, Size alignto, int flags)
+{
+ return MemoryContextAllocAligned(CurrentMemoryContext, size, alignto, flags);
+}
+
/*
* pfree
* Release an allocated chunk.
@@ -1306,11 +1422,16 @@ void
pfree(void *pointer)
{
#ifdef USE_VALGRIND
+ MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
MemoryContext context = GetMemoryChunkContext(pointer);
#endif
MCXT_METHOD(pointer, free_p) (pointer);
- VALGRIND_MEMPOOL_FREE(context, pointer);
+
+#ifdef USE_VALGRIND
+ if (method != MCTX_ALIGNED_REDIRECT_ID)
+ VALGRIND_MEMPOOL_FREE(context, pointer);
+#endif
}
/*
@@ -1320,6 +1441,9 @@ pfree(void *pointer)
void *
repalloc(void *pointer, Size size)
{
+#ifdef USE_VALGRIND
+ MemoryContextMethodID method = GetMemoryChunkMethodID(pointer);
+#endif
#if defined(USE_ASSERT_CHECKING) || defined(USE_VALGRIND)
MemoryContext context = GetMemoryChunkContext(pointer);
#endif
@@ -1346,7 +1470,10 @@ repalloc(void *pointer, Size size)
size, cxt->name)));
}
- VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
+#ifdef USE_VALGRIND
+ if (method != MCTX_ALIGNED_REDIRECT_ID)
+ VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
+#endif
return ret;
}
diff --git a/src/backend/utils/mmgr/meson.build b/src/backend/utils/mmgr/meson.build
index 641bb181ba..7cf4d6dcc8 100644
--- a/src/backend/utils/mmgr/meson.build
+++ b/src/backend/utils/mmgr/meson.build
@@ -1,4 +1,5 @@
backend_sources += files(
+ 'alignedalloc.c',
'aset.c',
'dsa.c',
'freepage.c',
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index bc2cbdd506..450bcba3ed 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -70,6 +70,15 @@ extern void SlabStats(MemoryContext context,
extern void SlabCheck(MemoryContext context);
#endif
+/*
+ * These functions support the implementation of palloc_aligned() and are not
+ * part of a fully-fledged MemoryContext type.
+ */
+extern void AlignedAllocFree(void *pointer);
+extern void *AlignedAllocRealloc(void *pointer, Size size);
+extern MemoryContext AlignedAllocGetChunkContext(void *pointer);
+extern Size AlignedGetChunkSpace(void *pointer);
+
/*
* MemoryContextMethodID
* A unique identifier for each MemoryContext implementation which
@@ -92,8 +101,8 @@ typedef enum MemoryContextMethodID
MCTX_ASET_ID,
MCTX_GENERATION_ID,
MCTX_SLAB_ID,
- MCTX_UNUSED4_ID, /* available */
- MCTX_UNUSED5_ID /* 111 occurs in wipe_mem'd memory */
+ MCTX_ALIGNED_REDIRECT_ID,
+ MCTX_UNUSED4_ID /* 111 occurs in wipe_mem'd memory */
} MemoryContextMethodID;
/*
diff --git a/src/include/utils/memutils_memorychunk.h b/src/include/utils/memutils_memorychunk.h
index 2eefc138e3..38702efc58 100644
--- a/src/include/utils/memutils_memorychunk.h
+++ b/src/include/utils/memutils_memorychunk.h
@@ -156,7 +156,7 @@ MemoryChunkSetHdrMask(MemoryChunk *chunk, void *block,
{
Size blockoffset = (char *) chunk - (char *) block;
- Assert((char *) chunk > (char *) block);
+ Assert((char *) chunk >= (char *) block);
Assert(blockoffset <= MEMORYCHUNK_MAX_BLOCKOFFSET);
Assert(value <= MEMORYCHUNK_MAX_VALUE);
Assert((int) methodid <= MEMORY_CONTEXT_METHODID_MASK);
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 72d4e70dc6..b1ac63b2ee 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -73,10 +73,13 @@ extern void *MemoryContextAllocZero(MemoryContext context, Size size);
extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size);
extern void *MemoryContextAllocExtended(MemoryContext context,
Size size, int flags);
+extern void *MemoryContextAllocAligned(MemoryContext context,
+ Size size, Size alignto, int flags);
extern void *palloc(Size size);
extern void *palloc0(Size size);
extern void *palloc_extended(Size size, int flags);
+extern void *palloc_aligned(Size size, Size alignto, int flags);
extern pg_nodiscard void *repalloc(void *pointer, Size size);
extern pg_nodiscard void *repalloc_extended(void *pointer,
Size size, int flags);
--
2.35.1
v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From caa6cbeb3b3f86c48c90513ee184aca500b1f703 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v2 2/4] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to be allowed to use O_DIRECT in a later commit, we need to
align buffers to the virtual memory page size. O_DIRECT would either
fail to work or fail to work efficiently without that on various
platforms. Even without O_DIRECT, aligning on memory pages improves
traditional buffered I/O performance.
The alignment size is set to 4096, which is enough for currently known
systems. There is no standard governing the requirements for O_DIRECT so
it's possible we might have to reconsider this approach or fail to work
on some exotic system, but for now this simplistic approach works and
it can be changed at compile time.
Adjust all call sites that allocate heap memory for file I/O to use the
new palloc_aligned() or MemoryContextAllocAligned() functions. For
stack-allocated buffers, introduce PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler. Also align the main
buffer pool in shared memory.
If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0. (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)
Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
contrib/bloom/blinsert.c | 2 +-
contrib/pg_prewarm/pg_prewarm.c | 2 +-
src/backend/access/gist/gistbuild.c | 9 +++---
src/backend/access/hash/hashpage.c | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/heap/visibilitymap.c | 2 +-
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 8 ++++--
src/backend/access/spgist/spginsert.c | 2 +-
src/backend/access/transam/generic_xlog.c | 13 ++++++---
src/backend/access/transam/xlog.c | 9 +++---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/buf_init.c | 10 +++++--
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/buffer/localbuf.c | 7 +++--
src/backend/storage/file/buffile.c | 6 ++++
src/backend/storage/freespace/freespace.c | 2 +-
src/backend/storage/page/bufpage.c | 5 +++-
src/backend/storage/smgr/md.c | 15 +++++++++-
src/backend/utils/sort/logtape.c | 2 +-
src/bin/pg_checksums/pg_checksums.c | 2 +-
src/bin/pg_rewind/local_source.c | 4 +--
src/bin/pg_upgrade/file.c | 4 +--
src/common/file_utils.c | 2 +-
src/include/c.h | 34 +++++++++++++++++------
src/include/pg_config_manual.h | 7 +++++
src/include/storage/fd.h | 5 ++--
src/tools/pgindent/typedefs.list | 1 +
28 files changed, 114 insertions(+), 49 deletions(-)
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..53cc617a66 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
BloomFillMetapage(index, metapage);
/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index caff5c4a80..f50aa69eb2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
PREWARM_BUFFER
} PrewarmType;
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
/*
* pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..d3d7d836e9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
* Write an empty page as a placeholder for the root page. It will be
* replaced with the real root page at the end.
*/
- page = palloc0(BLCKSZ);
+ page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
page, true);
state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
levelstate->current_page++;
if (levelstate->pages[levelstate->current_page] == NULL)
- levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+ levelstate->pages[levelstate->current_page] =
+ palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
newPage = levelstate->pages[levelstate->current_page];
gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
/* Create page and copy data */
data = (char *) (dist->list);
- target = palloc0(BLCKSZ);
+ target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
gistinitpage(target, isleaf ? F_LEAF : 0);
for (int i = 0; i < dist->block.num; i++)
{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
if (parent == NULL)
{
parent = palloc0(sizeof(GistSortedBuildLevelState));
- parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
parent->parent = NULL;
gistinitpage(parent->pages[0], 0);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 55b2929ad5..147af95e92 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
{
BlockNumber lastblock;
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
Page page;
HashPageOpaque ovflopaque;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..23d966940e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_old_rel = old_heap;
state->rs_new_rel = new_heap;
- state->rs_buffer = (Page) palloc(BLCKSZ);
+ state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..3bd65b275b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -620,7 +620,7 @@ static void
vm_extend(Relation rel, BlockNumber vm_nblocks)
{
BlockNumber vm_nblocks_now;
- PGAlignedBlock pg;
+ PGIOAlignedBlock pg;
SMgrRelation reln;
PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..e8ac7390ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..5e3c461f6f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
Page page;
BTPageOpaque opaque;
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
while (blkno > wstate->btws_pages_written)
{
if (!wstate->btws_zeropage)
- wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
/* don't set checksum for all-zero page */
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, rootblkno, rootlevel,
wstate->inskey->allequalimage);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..6f14b41329 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
Page page;
/* Construct metapage. */
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
SpGistInitMetapage(page);
/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6db9a1fca1..458e270d55 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
char delta[MAX_DELTA_SIZE]; /* delta between page images */
} PageData;
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction. Must be allocated at an I/O
+ * aligned address.
+ */
struct GenericXLogState
{
+ /* Page images (properly aligned, must be first) */
+ PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
/* Info about each page, see above */
PageData pages[MAX_GENERIC_XLOG_PAGES];
bool isLogged;
- /* Page images (properly aligned) */
- PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
};
static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
GenericXLogState *state;
int i;
- state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+ state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+ PG_IO_ALIGN_SIZE,
+ 0);
state->isLogged = RelationNeedsWAL(relation);
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..a75c6813a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4511,7 +4511,7 @@ XLOGShmemSize(void)
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
/* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
+ size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
/* and the buffers themselves */
size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
@@ -4608,10 +4608,11 @@ XLOGShmemInit(void)
/*
* Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * This simplifies some calculations in XLOG insertion. We also need I/O
+ * alignment for O_DIRECT, but that's also a power of two and usually
+ * smaller. Take the larger of the two alignment requirements.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
XLogCtl->pages = allocptr;
memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..0c5ac1f94b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
Page page;
bool use_wal;
bool copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..76a30d44b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
NBuffers * sizeof(BufferDescPadded),
&foundDescs);
+ /* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ, &foundBufs);
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ ShmemInitStruct("Buffer Blocks",
+ NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ &foundBufs));
/* Align condition variables to cacheline boundary. */
BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
/* to allow aligning buffer descriptors */
size = add_size(size, PG_CACHE_LINE_SIZE);
- /* size of data pages */
+ /* size of data pages, plus alignment padding */
+ size = add_size(size, PG_IO_ALIGN_SIZE);
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..aba07e94c9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3717,7 +3717,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
bool use_wal;
BlockNumber nblocks;
BlockNumber blkno;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
BufferAccessStrategy bstrategy_src;
BufferAccessStrategy bstrategy_dst;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..735f7c6018 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,11 @@ GetLocalBufferStorage(void)
/* And don't overflow MaxAllocSize, either */
num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
- cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
- num_bufs * BLCKSZ);
+ /* Buffers should be I/O aligned. */
+ cur_block = (char *)
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ MemoryContextAlloc(LocalBufferContext,
+ num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
next_buf_in_block = 0;
num_bufs_in_block = num_bufs;
}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b0b4eeb3bd..2261c3ebe3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
+
+ /*
+ * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+ * wasting per-file alignment padding when some users create many
+ * files.
+ */
PGAlignedBlock buffer;
};
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a6b0533103..7230d538fd 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,7 +608,7 @@ static void
fsm_extend(Relation rel, BlockNumber fsm_nblocks)
{
BlockNumber fsm_nblocks_now;
- PGAlignedBlock pg;
+ PGIOAlignedBlock pg;
SMgrRelation reln;
PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..0728ce30c0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
* and second to avoid wasting space in processes that never call this.
*/
if (pageCopy == NULL)
- pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+ pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+ BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ 0);
memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 14b6fa0fd9..3e034afdf1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
@@ -675,6 +679,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rlocator.locator.spcOid,
reln->smgr_rlocator.locator.dbOid,
@@ -740,6 +748,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum < mdnblocks(reln, forknum));
@@ -1294,7 +1306,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
*/
if (nblocks < ((BlockNumber) RELSEG_SIZE))
{
- char *zerobuf = palloc0(BLCKSZ);
+ char *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
mdextend(reln, forknum,
nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..6ba5030a5f 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
*/
while (blocknum > lts->nBlocksWritten)
{
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
MemSet(zerobuf.data, 0, sizeof(zerobuf));
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 7f3d5fc040..be6bae2b78 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
static void
scan_file(const char *fn, int segmentno)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
PageHeader header = (PageHeader) buf.data;
int f;
BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..83b37a1e91 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
size_t written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
off_t begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 079fbda838..b5809236f6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
{
int src_fd;
int dst_fd;
- PGAlignedBlock buffer;
- PGAlignedBlock new_vmbuf;
+ PGIOAlignedBlock buffer;
+ PGIOAlignedBlock new_vmbuf;
ssize_t totalBytesRead = 0;
ssize_t src_filesize;
int rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d8507d88a5..83ef0609a2 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
ssize_t
pg_pwrite_zeros(int fd, size_t size)
{
- PGAlignedBlock zbuffer; /* worth BLCKSZ */
+ PGIOAlignedBlock zbuffer; /* worth BLCKSZ */
size_t zbuffer_sz;
struct iovec iov[PG_IOV_MAX];
int blocks;
diff --git a/src/include/c.h b/src/include/c.h
index 98cdd285dd..9df92fb40e 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1068,14 +1068,11 @@ extern void ExceptionalCondition(const char *conditionName,
/*
* Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes. Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware. (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.) We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page. Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware. We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf. configure's
+ * computation of MAXIMUM_ALIGNOF).
*/
typedef union PGAlignedBlock
{
@@ -1084,9 +1081,30 @@ typedef union PGAlignedBlock
int64 force_align_i64;
} PGAlignedBlock;
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function. If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE. This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+ char data[BLCKSZ];
+ double force_align_d;
+ int64 force_align_i64;
+} PGIOAlignedBlock;
+
/* Same, but for an XLOG_BLCKSZ-sized buffer */
typedef union PGAlignedXLogBlock
{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
char data[XLOG_BLCKSZ];
double force_align_d;
int64 force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..323a4cfb4f 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,13 @@
*/
#define PG_CACHE_LINE_SIZE 128
+/*
+ * Assumed memory alignment requirement for direct I/O. On currently known
+ * systems this size applies, even for memory that is backed by larger virtual
+ * memory pages.
+ */
+#define PG_IO_ALIGN_SIZE 4096
+
/*
*------------------------------------------------------------------------
* The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7144fc9f60..85ef12c440 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
* to the appropriate Windows flag in src/port/open.c. We simulate it with
* fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper. We use the name
* PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix). We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
*/
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
#define PG_O_DIRECT O_DIRECT
#elif defined(F_NOCACHE)
#define PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..9a77664154 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1687,6 +1687,7 @@ PGEventResultDestroy
PGFInfoFunction
PGFileType
PGFunction
+PGIOAlignedBlock
PGLZ_HistEntry
PGLZ_Strategy
PGMessageField
--
2.35.1
v2-0003-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Add-io_direct-setting-developer-only.patchDownload
From e6692d744a7d041519e9c0998ce9f34aabc63c1e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v2 3/4] Add io_direct setting (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files. This hurts performance currently and is not
intended for end-users yet. Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.
This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
doc/src/sgml/config.sgml | 33 +++++++
src/backend/access/transam/xlog.c | 37 ++++----
src/backend/access/transam/xlogprefetcher.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 16 ++--
src/backend/storage/buffer/localbuf.c | 7 +-
src/backend/storage/file/fd.c | 88 +++++++++++++++++++
src/backend/storage/smgr/md.c | 24 +++--
src/backend/storage/smgr/smgr.c | 1 +
src/backend/utils/misc/guc_tables.c | 12 +++
src/include/storage/fd.h | 8 ++
src/include/storage/smgr.h | 1 +
src/include/utils/guc_hooks.h | 3 +
src/test/modules/test_misc/meson.build | 1 +
src/test/modules/test_misc/t/004_io_direct.pl | 40 +++++++++
14 files changed, 239 insertions(+), 34 deletions(-)
create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e4145979d..766d20f2ea 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11033,6 +11033,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-io-direct" xreflabel="io_direct">
+ <term><varname>io_direct</varname> (<type>string</type>)
+ <indexterm>
+ <primary><varname>io_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects for relation data and WAL
+ files using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+ </para>
+ <para>
+ May be set to an empty string (the default) to disable use of direct
+ I/O, or a comma-separated list of types of files for which direct I/O
+ is enabled. The valid types of file are <literal>data</literal> for
+ main data files, <literal>wal</literal> for WAL files, and
+ <literal>wal_init</literal> for WAL files when being initially
+ allocated.
+ </para>
+ <para>
+ Some operating systems and file systems do not support direct I/O, so
+ non-default settings may be rejected at startup, or produce I/O errors
+ at runtime.
+ </para>
+ <para>
+ Currently this feature reduces performance, and is intended for
+ developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
<term><varname>post_auth_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a75c6813a4..08a2f7a558 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2925,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
XLogSegNo max_segno;
int fd;
int save_errno;
+ int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
Assert(logtli != 0);
@@ -2957,8 +2958,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
unlink(tmppath);
+ if (io_direct_flags & IO_DIRECT_WAL_INIT)
+ open_flags |= PG_O_DIRECT;
+
/* do not use get_sync_bit() here --- want to fsync only at end of fill */
- fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = BasicOpenFile(tmppath, open_flags);
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -3350,7 +3354,7 @@ XLogFileClose(void)
* use the cache to read the WAL segment.
*/
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
- if (!XLogIsNeeded())
+ if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
@@ -4450,7 +4454,6 @@ show_in_hot_standby(void)
return RecoveryInProgress() ? "on" : "off";
}
-
/*
* Read the control file, set respective GUCs.
*
@@ -8034,35 +8037,27 @@ xlog_redo(XLogReaderState *record)
}
/*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
*/
static int
get_sync_bit(int method)
{
int o_direct_flag = 0;
- /* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
- return 0;
-
/*
- * Optimize writes by bypassing kernel cache with O_DIRECT when using
- * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
- * otherwise the archive command or walsender process will read the WAL
- * soon after writing it, which is guaranteed to cause a physical read if
- * we bypassed the kernel cache. We also skip the
- * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
- * reason.
- *
- * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+ * Use O_DIRECT if requested, except in walreceiver process. The WAL
* written by walreceiver is normally read by the startup process soon
- * after it's written. Also, walreceiver performs unaligned writes, which
+ * after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
- if (!XLogIsNeeded() && !AmWalReceiverProcess())
+ if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
+ /* If fsync is disabled, never open in sync mode */
+ if (!enableFsync)
+ return o_direct_flag;
+
switch (method)
{
/*
@@ -8074,7 +8069,7 @@ get_sync_bit(int method)
case SYNC_METHOD_FSYNC:
case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
- return 0;
+ return o_direct_flag;
#ifdef O_SYNC
case SYNC_METHOD_OPEN:
return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..992256dd09 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
block->prefetch_buffer = InvalidBuffer;
return LRQ_NEXT_IO;
}
- else
+ else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
{
/*
* This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba07e94c9..11c8187a55 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,8 +535,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* Try to initiate an asynchronous read. This returns false in
* recovery if the relation file doesn't exist.
*/
- if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr_reln, forkNum, blockNum))
+ {
result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
else
@@ -582,11 +585,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* the kernel and therefore didn't really initiate I/O, and no way to know when
* the I/O completes other than using synchronous ReadBuffer().
*
- * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and
* USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery. (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery. (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
*/
PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4911,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
{
PendingWriteback *pending;
+ if (io_direct_flags & IO_DIRECT_DATA)
+ return;
+
/*
* Add buffer to the pending writeback array, unless writeback control is
* disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 735f7c6018..b01e319641 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
{
#ifdef USE_PREFETCH
/* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
- result.initiated_io = true;
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr, forkNum, blockNum))
+ {
+ result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f6c9382023..0829e9b8df 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
+#include "utils/guc_hooks.h"
#include "utils/resowner_private.h"
+#include "utils/varlena.h"
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
#if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool data_sync_retry = false;
/* How SyncDataDirectory() should do its job. */
int recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int io_direct_flags;
+
/* Debugging.... */
#ifdef FDDEBUG
@@ -2021,6 +2026,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
if (nbytes <= 0)
return;
+#ifdef PG_O_DIRECT
+ if (VfdCache[file].fileFlags & PG_O_DIRECT)
+ return;
+#endif
+
returnCode = FileAccess(file);
if (returnCode < 0)
return;
@@ -3737,3 +3747,81 @@ data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+#if PG_O_DIRECT == 0
+ if (*newval)
+ {
+ GUC_check_errdetail("io_direct is not supported on this platform.");
+ return false;
+ }
+#else
+ List *list;
+ ListCell *l;
+ int *flags;
+
+ if (!SplitGUCList(*newval, ',', &list))
+ {
+ GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+ "io_direct");
+ return false;
+ }
+
+ flags = guc_malloc(ERROR, sizeof(*flags));
+ *flags = 0;
+ foreach (l, list)
+ {
+ char *item = (char *) lfirst(l);
+
+ if (pg_strcasecmp(item, "data") == 0)
+ *flags |= IO_DIRECT_DATA;
+ else if (pg_strcasecmp(item, "wal") == 0)
+ *flags |= IO_DIRECT_WAL;
+ else if (pg_strcasecmp(item, "wal_init") == 0)
+ *flags |= IO_DIRECT_WAL_INIT;
+ else
+ {
+ GUC_check_errdetail("invalid option \"%s\"", item);
+ return false;
+ }
+ }
+
+ *extra = flags;
+
+ return true;
+#endif
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+ int *flags = (int *) extra;
+
+ io_direct_flags = *flags;
+}
+
+extern const char *
+show_io_direct(void)
+{
+ static char result[80];
+
+ result[0] = 0;
+ if (io_direct_flags & IO_DIRECT_DATA)
+ strcat(result, "data");
+ if (io_direct_flags & IO_DIRECT_WAL)
+ {
+ if (result[0])
+ strcat(result, ", ");
+ strcat(result, "wal");
+ }
+ if (io_direct_flags & IO_DIRECT_WAL_INIT)
+ {
+ if (result[0])
+ strcat(result, ", ");
+ strcat(result, "wal_init");
+ }
+
+ return result;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3e034afdf1..38263f3d0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+ int flags = O_RDWR | PG_BINARY;
+
+ if (io_direct_flags & IO_DIRECT_DATA)
+ flags |= PG_O_DIRECT;
+
+ return flags;
+}
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
if (fd < 0)
{
int save_errno = errno;
if (isRedo)
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
/* be sure to report the error reported by create, not open */
@@ -527,7 +537,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
@@ -598,6 +608,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
off_t seekpos;
MdfdVec *v;
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
v = _mdfd_getseg(reln, forknum, blocknum, false,
InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
if (v == NULL)
@@ -623,6 +635,8 @@ void
mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks)
{
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
/*
* Issue flush requests in as few requests as possible; have to split at
* segment boundaries though, since those are actually separate files.
@@ -1200,7 +1214,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
fullpath = _mdfd_segpath(reln, forknum, segno);
/* open the file */
- fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+ fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
pfree(fullpath);
@@ -1410,7 +1424,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
strlcpy(path, p, MAXPGPATH);
pfree(p);
- file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
if (file < 0)
return -1;
need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..4892920812 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
#include "storage/bufmgr.h"
+#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/md.h"
#include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec66..1de30ebbf1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -543,6 +543,7 @@ static char *locale_ctype;
static char *server_encoding_string;
static char *server_version_string;
static int server_version_num;
+static char *io_direct_string;
#ifdef HAVE_SYSLOG
#define DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4468,6 +4469,17 @@ struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Use direct I/O for file access."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &io_direct_string,
+ "",
+ check_io_direct, assign_io_direct, show_io_direct
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 85ef12c440..0d65cb3c80 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,8 @@
#define FD_H
#include <dirent.h>
+#include <fcntl.h>
+
typedef enum RecoveryInitSyncMethod
{
@@ -54,10 +56,16 @@ typedef enum RecoveryInitSyncMethod
typedef int File;
+#define IO_DIRECT_DATA 0x01
+#define IO_DIRECT_WAL 0x02
+#define IO_DIRECT_WAL_INIT 0x04
+
+
/* GUC parameter */
extern PGDLLIMPORT int max_files_per_process;
extern PGDLLIMPORT bool data_sync_retry;
extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
/*
* This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ea7b3ff8dd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
#include "lib/ilist.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "utils/guc.h"
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..61a7fd77b8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -154,5 +154,8 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
GucSource source);
extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
+extern const char *show_io_direct(void);
#endif /* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index cfc830ff39..97162d2b8f 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -7,6 +7,7 @@ tests += {
't/001_constraint_validation.pl',
't/002_tablespace.pl',
't/003_check_guc.pl',
+ 't/004_io_direct.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..9a79fc8f9d
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,40 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error. (illumos should
+# probably be in this list, but perl reports it as solaris. Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it) and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+ plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;');
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
--
2.35.1
v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From 75dec1b3ffa91ca1279267092187191ca99fb713 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v2 4/4] XXX turn on direct I/O by default, just for CI
---
src/backend/utils/misc/guc_tables.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1de30ebbf1..0fc2185568 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4476,7 +4476,7 @@ struct config_string ConfigureNamesString[] =
GUC_NOT_IN_SAMPLE
},
&io_direct_string,
- "",
+ "data,wal,wal_init",
check_io_direct, assign_io_direct, show_io_direct
},
--
2.35.1
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on
David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.
Attachments:
v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From f6adf05ffa5bdf43cd3ca2bcc4dba39d1474ce09 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v3 1/3] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to be allowed to use O_DIRECT in a later commit, we need to
align buffers to the virtual memory page size. O_DIRECT would either
fail to work or fail to work efficiently without that on various
platforms. Even without O_DIRECT, aligning on memory pages improves
traditional buffered I/O performance.
The alignment size is set to 4096, which is enough for currently known
systems. There is no standard governing the requirements for O_DIRECT so
it's possible we might have to reconsider this approach or fail to work
on some exotic system, but for now this simplistic approach works and
it can be changed at compile time.
Adjust all call sites that allocate heap memory for file I/O to use the
new palloc_aligned() or MemoryContextAllocAligned() functions. For
stack-allocated buffers, introduce PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler. Also align the main
buffer pool in shared memory.
If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0. (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)
Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
contrib/bloom/blinsert.c | 2 +-
contrib/pg_prewarm/pg_prewarm.c | 2 +-
src/backend/access/gist/gistbuild.c | 9 +++---
src/backend/access/hash/hashpage.c | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/heap/visibilitymap.c | 2 +-
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 8 ++++--
src/backend/access/spgist/spginsert.c | 2 +-
src/backend/access/transam/generic_xlog.c | 13 ++++++---
src/backend/access/transam/xlog.c | 9 +++---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/buf_init.c | 10 +++++--
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/buffer/localbuf.c | 7 +++--
src/backend/storage/file/buffile.c | 6 ++++
src/backend/storage/freespace/freespace.c | 2 +-
src/backend/storage/page/bufpage.c | 5 +++-
src/backend/storage/smgr/md.c | 15 +++++++++-
src/backend/utils/sort/logtape.c | 2 +-
src/bin/pg_checksums/pg_checksums.c | 2 +-
src/bin/pg_rewind/local_source.c | 4 +--
src/bin/pg_upgrade/file.c | 4 +--
src/common/file_utils.c | 2 +-
src/include/c.h | 34 +++++++++++++++++------
src/include/pg_config_manual.h | 7 +++++
src/include/storage/fd.h | 5 ++--
src/tools/pgindent/typedefs.list | 1 +
28 files changed, 114 insertions(+), 49 deletions(-)
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dd26d6ac29..53cc617a66 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
BloomFillMetapage(index, metapage);
/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index caff5c4a80..f50aa69eb2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
PREWARM_BUFFER
} PrewarmType;
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
/*
* pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fb0f466708..d3d7d836e9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
* Write an empty page as a placeholder for the root page. It will be
* replaced with the real root page at the end.
*/
- page = palloc0(BLCKSZ);
+ page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
page, true);
state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
levelstate->current_page++;
if (levelstate->pages[levelstate->current_page] == NULL)
- levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+ levelstate->pages[levelstate->current_page] =
+ palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
newPage = levelstate->pages[levelstate->current_page];
gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
/* Create page and copy data */
data = (char *) (dist->list);
- target = palloc0(BLCKSZ);
+ target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
gistinitpage(target, isleaf ? F_LEAF : 0);
for (int i = 0; i < dist->block.num; i++)
{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
if (parent == NULL)
{
parent = palloc0(sizeof(GistSortedBuildLevelState));
- parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
parent->parent = NULL;
gistinitpage(parent->pages[0], 0);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 55b2929ad5..147af95e92 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
{
BlockNumber lastblock;
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
Page page;
HashPageOpaque ovflopaque;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2fe9e48e50..23d966940e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_old_rel = old_heap;
state->rs_new_rel = new_heap;
- state->rs_buffer = (Page) palloc(BLCKSZ);
+ state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..3bd65b275b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -620,7 +620,7 @@ static void
vm_extend(Relation rel, BlockNumber vm_nblocks)
{
BlockNumber vm_nblocks_now;
- PGAlignedBlock pg;
+ PGIOAlignedBlock pg;
SMgrRelation reln;
PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..e8ac7390ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -153,7 +153,7 @@ btbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..5e3c461f6f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
Page page;
BTPageOpaque opaque;
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
while (blkno > wstate->btws_pages_written)
{
if (!wstate->btws_zeropage)
- wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
/* don't set checksum for all-zero page */
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, rootblkno, rootlevel,
wstate->inskey->allequalimage);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index c6821b5952..6f14b41329 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
Page page;
/* Construct metapage. */
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
SpGistInitMetapage(page);
/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6db9a1fca1..458e270d55 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
char delta[MAX_DELTA_SIZE]; /* delta between page images */
} PageData;
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction. Must be allocated at an I/O
+ * aligned address.
+ */
struct GenericXLogState
{
+ /* Page images (properly aligned, must be first) */
+ PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
/* Info about each page, see above */
PageData pages[MAX_GENERIC_XLOG_PAGES];
bool isLogged;
- /* Page images (properly aligned) */
- PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
};
static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
GenericXLogState *state;
int i;
- state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+ state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+ PG_IO_ALIGN_SIZE,
+ 0);
state->isLogged = RelationNeedsWAL(relation);
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..172b4a2fcf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4502,7 +4502,7 @@ XLOGShmemSize(void)
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
/* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
+ size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
/* and the buffers themselves */
size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
@@ -4599,10 +4599,11 @@ XLOGShmemInit(void)
/*
* Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * This simplifies some calculations in XLOG insertion. We also need I/O
+ * alignment for O_DIRECT, but that's also a power of two and usually
+ * smaller. Take the larger of the two alignment requirements.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
XLogCtl->pages = allocptr;
memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..0c5ac1f94b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
Page page;
bool use_wal;
bool copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6b6264854e..76a30d44b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
NBuffers * sizeof(BufferDescPadded),
&foundDescs);
+ /* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ, &foundBufs);
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ ShmemInitStruct("Buffer Blocks",
+ NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ &foundBufs));
/* Align condition variables to cacheline boundary. */
BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
/* to allow aligning buffer descriptors */
size = add_size(size, PG_CACHE_LINE_SIZE);
- /* size of data pages */
+ /* size of data pages, plus alignment padding */
+ size = add_size(size, PG_IO_ALIGN_SIZE);
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..aba07e94c9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3717,7 +3717,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
bool use_wal;
BlockNumber nblocks;
BlockNumber blkno;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
BufferAccessStrategy bstrategy_src;
BufferAccessStrategy bstrategy_dst;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..735f7c6018 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -546,8 +546,11 @@ GetLocalBufferStorage(void)
/* And don't overflow MaxAllocSize, either */
num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
- cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
- num_bufs * BLCKSZ);
+ /* Buffers should be I/O aligned. */
+ cur_block = (char *)
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ MemoryContextAlloc(LocalBufferContext,
+ num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
next_buf_in_block = 0;
num_bufs_in_block = num_bufs;
}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b0b4eeb3bd..2261c3ebe3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
+
+ /*
+ * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+ * wasting per-file alignment padding when some users create many
+ * files.
+ */
PGAlignedBlock buffer;
};
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a6b0533103..7230d538fd 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,7 +608,7 @@ static void
fsm_extend(Relation rel, BlockNumber fsm_nblocks)
{
BlockNumber fsm_nblocks_now;
- PGAlignedBlock pg;
+ PGIOAlignedBlock pg;
SMgrRelation reln;
PageInit((Page) pg.data, BLCKSZ, 0);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 8b617c7e79..0728ce30c0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
* and second to avoid wasting space in processes that never call this.
*/
if (pageCopy == NULL)
- pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+ pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+ BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ 0);
memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 14b6fa0fd9..3e034afdf1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
@@ -675,6 +679,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rlocator.locator.spcOid,
reln->smgr_rlocator.locator.dbOid,
@@ -740,6 +748,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum < mdnblocks(reln, forknum));
@@ -1294,7 +1306,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
*/
if (nblocks < ((BlockNumber) RELSEG_SIZE))
{
- char *zerobuf = palloc0(BLCKSZ);
+ char *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
mdextend(reln, forknum,
nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..6ba5030a5f 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
*/
while (blocknum > lts->nBlocksWritten)
{
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
MemSet(zerobuf.data, 0, sizeof(zerobuf));
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 7f3d5fc040..be6bae2b78 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
static void
scan_file(const char *fn, int segmentno)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
PageHeader header = (PageHeader) buf.data;
int f;
BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..83b37a1e91 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
size_t written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
off_t begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 079fbda838..b5809236f6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
{
int src_fd;
int dst_fd;
- PGAlignedBlock buffer;
- PGAlignedBlock new_vmbuf;
+ PGIOAlignedBlock buffer;
+ PGIOAlignedBlock new_vmbuf;
ssize_t totalBytesRead = 0;
ssize_t src_filesize;
int rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d8507d88a5..83ef0609a2 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
ssize_t
pg_pwrite_zeros(int fd, size_t size)
{
- PGAlignedBlock zbuffer; /* worth BLCKSZ */
+ PGIOAlignedBlock zbuffer; /* worth BLCKSZ */
size_t zbuffer_sz;
struct iovec iov[PG_IOV_MAX];
int blocks;
diff --git a/src/include/c.h b/src/include/c.h
index bd6d8e5bf5..d811181fdf 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1071,14 +1071,11 @@ extern void ExceptionalCondition(const char *conditionName,
/*
* Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes. Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware. (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.) We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page. Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware. We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf. configure's
+ * computation of MAXIMUM_ALIGNOF).
*/
typedef union PGAlignedBlock
{
@@ -1087,9 +1084,30 @@ typedef union PGAlignedBlock
int64 force_align_i64;
} PGAlignedBlock;
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function. If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE. This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+ char data[BLCKSZ];
+ double force_align_d;
+ int64 force_align_i64;
+} PGIOAlignedBlock;
+
/* Same, but for an XLOG_BLCKSZ-sized buffer */
typedef union PGAlignedXLogBlock
{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
char data[XLOG_BLCKSZ];
double force_align_d;
int64 force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983..323a4cfb4f 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,13 @@
*/
#define PG_CACHE_LINE_SIZE 128
+/*
+ * Assumed memory alignment requirement for direct I/O. On currently known
+ * systems this size applies, even for memory that is backed by larger virtual
+ * memory pages.
+ */
+#define PG_IO_ALIGN_SIZE 4096
+
/*
*------------------------------------------------------------------------
* The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7144fc9f60..85ef12c440 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
* to the appropriate Windows flag in src/port/open.c. We simulate it with
* fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper. We use the name
* PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix). We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
*/
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
#define PG_O_DIRECT O_DIRECT
#elif defined(F_NOCACHE)
#define PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..9a77664154 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1687,6 +1687,7 @@ PGEventResultDestroy
PGFInfoFunction
PGFileType
PGFunction
+PGIOAlignedBlock
PGLZ_HistEntry
PGLZ_Strategy
PGMessageField
--
2.35.1
v3-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Add-io_direct-setting-developer-only.patchDownload
From fc1ccbbfd4a0e4c29cee8695a091fea0353b442d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v3 2/3] Add io_direct setting (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files. This hurts performance currently and is not
intended for end-users yet. Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.
This replaces the previous logic that would use O_DIRECT for the WAL in
limited and obscure cases, now that there is an explicit setting.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
doc/src/sgml/config.sgml | 33 +++++++
src/backend/access/transam/xlog.c | 37 ++++----
src/backend/access/transam/xlogprefetcher.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 16 ++--
src/backend/storage/buffer/localbuf.c | 7 +-
src/backend/storage/file/fd.c | 87 +++++++++++++++++++
src/backend/storage/smgr/md.c | 24 +++--
src/backend/storage/smgr/smgr.c | 1 +
src/backend/utils/misc/guc_tables.c | 12 +++
src/include/storage/fd.h | 8 ++
src/include/storage/smgr.h | 1 +
src/include/utils/guc_hooks.h | 3 +
src/test/modules/test_misc/meson.build | 1 +
src/test/modules/test_misc/t/004_io_direct.pl | 40 +++++++++
14 files changed, 238 insertions(+), 34 deletions(-)
create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9eedab652d..70614d4fcc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11056,6 +11056,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-io-direct" xreflabel="io_direct">
+ <term><varname>io_direct</varname> (<type>string</type>)
+ <indexterm>
+ <primary><varname>io_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects for relation data and WAL
+ files using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+ </para>
+ <para>
+ May be set to an empty string (the default) to disable use of direct
+ I/O, or a comma-separated list of types of files for which direct I/O
+ is enabled. The valid types of file are <literal>data</literal> for
+ main data files, <literal>wal</literal> for WAL files, and
+ <literal>wal_init</literal> for WAL files when being initially
+ allocated.
+ </para>
+ <para>
+ Some operating systems and file systems do not support direct I/O, so
+ non-default settings may be rejected at startup, or produce I/O errors
+ at runtime.
+ </para>
+ <para>
+ Currently this feature reduces performance, and is intended for
+ developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
<term><varname>post_auth_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 172b4a2fcf..9a4f4ca711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2925,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
XLogSegNo max_segno;
int fd;
int save_errno;
+ int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
Assert(logtli != 0);
@@ -2957,8 +2958,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
unlink(tmppath);
+ if (io_direct_flags & IO_DIRECT_WAL_INIT)
+ open_flags |= PG_O_DIRECT;
+
/* do not use get_sync_bit() here --- want to fsync only at end of fill */
- fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = BasicOpenFile(tmppath, open_flags);
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -3350,7 +3354,7 @@ XLogFileClose(void)
* use the cache to read the WAL segment.
*/
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
- if (!XLogIsNeeded())
+ if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
@@ -4441,7 +4445,6 @@ show_in_hot_standby(void)
return RecoveryInProgress() ? "on" : "off";
}
-
/*
* Read the control file, set respective GUCs.
*
@@ -8025,35 +8028,27 @@ xlog_redo(XLogReaderState *record)
}
/*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
*/
static int
get_sync_bit(int method)
{
int o_direct_flag = 0;
- /* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
- return 0;
-
/*
- * Optimize writes by bypassing kernel cache with O_DIRECT when using
- * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
- * otherwise the archive command or walsender process will read the WAL
- * soon after writing it, which is guaranteed to cause a physical read if
- * we bypassed the kernel cache. We also skip the
- * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
- * reason.
- *
- * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+ * Use O_DIRECT if requested, except in walreceiver process. The WAL
* written by walreceiver is normally read by the startup process soon
- * after it's written. Also, walreceiver performs unaligned writes, which
+ * after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
- if (!XLogIsNeeded() && !AmWalReceiverProcess())
+ if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
+ /* If fsync is disabled, never open in sync mode */
+ if (!enableFsync)
+ return o_direct_flag;
+
switch (method)
{
/*
@@ -8065,7 +8060,7 @@ get_sync_bit(int method)
case SYNC_METHOD_FSYNC:
case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
- return 0;
+ return o_direct_flag;
#ifdef O_SYNC
case SYNC_METHOD_OPEN:
return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 0cf03945ee..992256dd09 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
block->prefetch_buffer = InvalidBuffer;
return LRQ_NEXT_IO;
}
- else
+ else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
{
/*
* This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba07e94c9..11c8187a55 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -535,8 +535,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* Try to initiate an asynchronous read. This returns false in
* recovery if the relation file doesn't exist.
*/
- if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr_reln, forkNum, blockNum))
+ {
result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
else
@@ -582,11 +585,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* the kernel and therefore didn't really initiate I/O, and no way to know when
* the I/O completes other than using synchronous ReadBuffer().
*
- * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and
* USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery. (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery. (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
*/
PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -4908,6 +4911,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
{
PendingWriteback *pending;
+ if (io_direct_flags & IO_DIRECT_DATA)
+ return;
+
/*
* Add buffer to the pending writeback array, unless writeback control is
* disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 735f7c6018..b01e319641 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -87,8 +87,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
{
#ifdef USE_PREFETCH
/* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
- result.initiated_io = true;
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr, forkNum, blockNum))
+ {
+ result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f6c9382023..6d1af80f9b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
+#include "utils/guc_hooks.h"
#include "utils/resowner_private.h"
+#include "utils/varlena.h"
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
#if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool data_sync_retry = false;
/* How SyncDataDirectory() should do its job. */
int recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int io_direct_flags;
+
/* Debugging.... */
#ifdef FDDEBUG
@@ -2021,6 +2026,9 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
if (nbytes <= 0)
return;
+ if (VfdCache[file].fileFlags & PG_O_DIRECT)
+ return;
+
returnCode = FileAccess(file);
if (returnCode < 0)
return;
@@ -3737,3 +3745,82 @@ data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+ int *flags = guc_malloc(ERROR, sizeof(*flags));
+
+#if PG_O_DIRECT == 0
+ if (strcmp(*newval, "") != 0)
+ {
+ GUC_check_errdetail("io_direct is not supported on this platform.");
+ return false;
+ }
+ *flags = 0;
+#else
+ List *list;
+ ListCell *l;
+
+ if (!SplitGUCList(*newval, ',', &list))
+ {
+ GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+ "io_direct");
+ return false;
+ }
+
+ *flags = 0;
+ foreach (l, list)
+ {
+ char *item = (char *) lfirst(l);
+
+ if (pg_strcasecmp(item, "data") == 0)
+ *flags |= IO_DIRECT_DATA;
+ else if (pg_strcasecmp(item, "wal") == 0)
+ *flags |= IO_DIRECT_WAL;
+ else if (pg_strcasecmp(item, "wal_init") == 0)
+ *flags |= IO_DIRECT_WAL_INIT;
+ else
+ {
+ GUC_check_errdetail("invalid option \"%s\"", item);
+ return false;
+ }
+ }
+#endif
+
+ *extra = flags;
+
+ return true;
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+ int *flags = (int *) extra;
+
+ io_direct_flags = *flags;
+}
+
+extern const char *
+show_io_direct(void)
+{
+ static char result[80];
+
+ result[0] = 0;
+ if (io_direct_flags & IO_DIRECT_DATA)
+ strcat(result, "data");
+ if (io_direct_flags & IO_DIRECT_WAL)
+ {
+ if (result[0])
+ strcat(result, ", ");
+ strcat(result, "wal");
+ }
+ if (io_direct_flags & IO_DIRECT_WAL_INIT)
+ {
+ if (result[0])
+ strcat(result, ", ");
+ strcat(result, "wal_init");
+ }
+
+ return result;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3e034afdf1..38263f3d0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+ int flags = O_RDWR | PG_BINARY;
+
+ if (io_direct_flags & IO_DIRECT_DATA)
+ flags |= PG_O_DIRECT;
+
+ return flags;
+}
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
if (fd < 0)
{
int save_errno = errno;
if (isRedo)
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
/* be sure to report the error reported by create, not open */
@@ -527,7 +537,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
@@ -598,6 +608,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
off_t seekpos;
MdfdVec *v;
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
v = _mdfd_getseg(reln, forknum, blocknum, false,
InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
if (v == NULL)
@@ -623,6 +635,8 @@ void
mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks)
{
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
/*
* Issue flush requests in as few requests as possible; have to split at
* segment boundaries though, since those are actually separate files.
@@ -1200,7 +1214,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
fullpath = _mdfd_segpath(reln, forknum, segno);
/* open the file */
- fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+ fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
pfree(fullpath);
@@ -1410,7 +1424,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
strlcpy(path, p, MAXPGPATH);
pfree(p);
- file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
if (file < 0)
return -1;
need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..4892920812 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
#include "storage/bufmgr.h"
+#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/md.h"
#include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 436afe1d21..e12bb15669 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -543,6 +543,7 @@ static char *locale_ctype;
static char *server_encoding_string;
static char *server_version_string;
static int server_version_num;
+static char *io_direct_string;
#ifdef HAVE_SYSLOG
#define DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4483,6 +4484,17 @@ struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Use direct I/O for file access."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &io_direct_string,
+ "",
+ check_io_direct, assign_io_direct, show_io_direct
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 85ef12c440..0d65cb3c80 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,8 @@
#define FD_H
#include <dirent.h>
+#include <fcntl.h>
+
typedef enum RecoveryInitSyncMethod
{
@@ -54,10 +56,16 @@ typedef enum RecoveryInitSyncMethod
typedef int File;
+#define IO_DIRECT_DATA 0x01
+#define IO_DIRECT_WAL 0x02
+#define IO_DIRECT_WAL_INIT 0x04
+
+
/* GUC parameter */
extern PGDLLIMPORT int max_files_per_process;
extern PGDLLIMPORT bool data_sync_retry;
extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
/*
* This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..ea7b3ff8dd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
#include "lib/ilist.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "utils/guc.h"
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f1a9a183b4..61a7fd77b8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -154,5 +154,8 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
GucSource source);
extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
+extern const char *show_io_direct(void);
#endif /* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index b7478c3125..bbed7093d0 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
't/001_constraint_validation.pl',
't/002_tablespace.pl',
't/003_check_guc.pl',
+ 't/004_io_direct.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..803fb334e7
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,40 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error. (illumos should
+# probably be in this list, but perl reports it as solaris. Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it, and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+ plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;');
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
--
2.35.1
v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From c2f343aea63f2837fa22047ef3298bddf443646a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v3 3/3] XXX turn on direct I/O by default, just for CI
---
src/backend/utils/misc/guc_tables.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e12bb15669..762e9f8590 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4491,7 +4491,7 @@ struct config_string ConfigureNamesString[] =
GUC_NOT_IN_SAMPLE
},
&io_direct_string,
- "",
+ "data,wal,wal_init",
check_io_direct, assign_io_direct, show_io_direct
},
--
2.35.1
On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs onDavid pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.
Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:
1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.
I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.
Some comments on the tests added:
1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'),
"read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'),
"read back from local");
+$node->stop('immediate');
2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');
4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Review-comments-io_direct-GUC.txttext/plain; charset=US-ASCII; name=v1-0001-Review-comments-io_direct-GUC.txtDownload
From b3eed3d6fc849b9e16fbace1f37d401424f81ab0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 25 Jan 2023 07:18:11 +0000
Subject: [PATCH v1] Review comments io_direct GUC
---
src/backend/storage/file/fd.c | 57 ++++++++++++-----------------
src/backend/utils/misc/guc_tables.c | 4 +-
src/include/utils/guc_hooks.h | 1 -
3 files changed, 25 insertions(+), 37 deletions(-)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index eb83de4fb9..329acc2ffd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3749,7 +3749,7 @@ data_sync_elevel(int elevel)
bool
check_io_direct(char **newval, void **extra, GucSource source)
{
- int *flags = guc_malloc(ERROR, sizeof(*flags));
+ int flags;
#if PG_O_DIRECT == 0
if (strcmp(*newval, "") != 0)
@@ -3757,38 +3757,51 @@ check_io_direct(char **newval, void **extra, GucSource source)
GUC_check_errdetail("io_direct is not supported on this platform.");
return false;
}
- *flags = 0;
+ flags = 0;
#else
- List *list;
+ List *elemlist;
ListCell *l;
+ char *rawstring;
- if (!SplitGUCList(*newval, ',', &list))
+ /* Need a modifiable copy of string */
+ rawstring = pstrdup(*newval);
+
+ if (!SplitGUCList(rawstring, ',', &elemlist))
{
GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
"io_direct");
+ pfree(rawstring);
+ list_free(elemlist);
return false;
}
- *flags = 0;
- foreach (l, list)
+ flags = 0;
+ foreach (l, elemlist)
{
char *item = (char *) lfirst(l);
if (pg_strcasecmp(item, "data") == 0)
- *flags |= IO_DIRECT_DATA;
+ flags |= IO_DIRECT_DATA;
else if (pg_strcasecmp(item, "wal") == 0)
- *flags |= IO_DIRECT_WAL;
+ flags |= IO_DIRECT_WAL;
else if (pg_strcasecmp(item, "wal_init") == 0)
- *flags |= IO_DIRECT_WAL_INIT;
+ flags |= IO_DIRECT_WAL_INIT;
else
{
GUC_check_errdetail("invalid option \"%s\"", item);
+ pfree(rawstring);
+ list_free(elemlist);
return false;
}
}
+
+ pfree(rawstring);
+ list_free(elemlist);
#endif
- *extra = flags;
+ /* Save the flags in *extra, for use by assign_io_direct */
+ *extra = guc_malloc(ERROR, sizeof(int));
+ *((int *) *extra) = flags;
return true;
}
@@ -3800,27 +3813,3 @@ assign_io_direct(const char *newval, void *extra)
io_direct_flags = *flags;
}
-
-extern const char *
-show_io_direct(void)
-{
- static char result[80];
-
- result[0] = 0;
- if (io_direct_flags & IO_DIRECT_DATA)
- strcat(result, "data");
- if (io_direct_flags & IO_DIRECT_WAL)
- {
- if (result[0])
- strcat(result, ", ");
- strcat(result, "wal");
- }
- if (io_direct_flags & IO_DIRECT_WAL_INIT)
- {
- if (result[0])
- strcat(result, ", ");
- strcat(result, "wal_init");
- }
-
- return result;
-}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9410493ae7..25b7e87abb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4529,11 +4529,11 @@ struct config_string ConfigureNamesString[] =
{"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Use direct I/O for file access."),
NULL,
- GUC_NOT_IN_SAMPLE
+ GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
},
&io_direct_string,
"",
- check_io_direct, assign_io_direct, show_io_direct
+ check_io_direct, assign_io_direct, NULL
},
/* End-of-list marker */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index e656a16a40..b3b5148185 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -156,6 +156,5 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
extern bool check_io_direct(char **newval, void **extra, GucSource source);
extern void assign_io_direct(const char *newval, void *extra);
-extern const char *show_io_direct(void);
#endif /* GUC_HOOKS_H */
--
2.34.1
On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
Thanks for looking at this, and sorry for the late response. Yeah, agreed.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.
Thanks. Your way is nicer. I merged your patch and added you as a co-author.
Some comments on the tests added:
1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this. I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]: https://twitter.com/MengTangmu/status/994770040745615361
Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT). (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2]http://kos.enix.org/pub/gingell8.pdf, but every
(?) Unixoid OS copied it anyway, with variations... Apparently the
Windows thing is called VirtualQuery()).
2. Can we combine these two append_conf to a single statement? +$node->append_conf('io_direct', 'data,wal,wal_init'); +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
OK, sure, done. And also oops, that was completely wrong and not
working the way I had it in that version...
3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');
OK.
4. I don't think we need to stop the node before the test ends, no? +$node->stop; + +done_testing();
Sure, but why not?
Otherwise, I rebased, and made a couple more changes:
I found a line of the manual about wal_sync_method that needed to be removed:
- The <literal>open_</literal>* options also use
<literal>O_DIRECT</literal> if available.
In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it). I think we should adjust that
misleading sentence in back-branches, separately from this patch set.
I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.
Also a few minor comment changes.
[1]: https://twitter.com/MengTangmu/status/994770040745615361
[2]: http://kos.enix.org/pub/gingell8.pdf
Attachments:
v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload
From c6e01d506762fb7c11a3fb31d56902fa53ea822b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v4 1/3] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to be able to use O_DIRECT/FILE_FLAG_NO_BUFFERING on common
systems in a later commit, we need the address and length of user space
buffers to align with the sector size of the storage. O_DIRECT would
either fail to work or fail to work efficiently without that on various
platforms. Even without O_DIRECT, aligning on memory pages is known to
improve traditional buffered I/O performance.
The alignment size is set to 4096, which is enough for currently known
systems: it covers traditional 512 byte sectors and modern 4096 byte
sectors, as well as common 4096 byte memory pages. There is no standard
governing the requirements for O_DIRECT so it's possible we might have
to reconsider this approach or fail to work on some exotic system, but
for now this simplistic approach works and it can be changed at compile
time.
Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The main buffer
pool is also aligned in shared memory.
If arbitrary alignment of stack objects is not possible with this
compiler, then completely disable the use of O_DIRECT by setting
PG_O_DIRECT to 0. (This avoids the need to consider systems that have
O_DIRECT but don't have a compiler with an extension that can align
stack objects the way we want; that could be done but we don't currently
know of any such system, so it's easier to pretend there is no O_DIRECT
support instead: that's an existing and tested class of system.)
Add assertions that all buffers passed into smgrread(), smgrwrite(),
smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
---
contrib/bloom/blinsert.c | 2 +-
contrib/pg_prewarm/pg_prewarm.c | 2 +-
src/backend/access/gist/gistbuild.c | 9 +++---
src/backend/access/hash/hashpage.c | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 8 ++++--
src/backend/access/spgist/spginsert.c | 2 +-
src/backend/access/transam/generic_xlog.c | 13 ++++++---
src/backend/access/transam/xlog.c | 9 +++---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/buf_init.c | 10 +++++--
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/buffer/localbuf.c | 7 +++--
src/backend/storage/file/buffile.c | 6 ++++
src/backend/storage/page/bufpage.c | 5 +++-
src/backend/storage/smgr/md.c | 15 +++++++++-
src/backend/utils/sort/logtape.c | 2 +-
src/bin/pg_checksums/pg_checksums.c | 2 +-
src/bin/pg_rewind/local_source.c | 4 +--
src/bin/pg_upgrade/file.c | 4 +--
src/common/file_utils.c | 4 +--
src/include/c.h | 34 +++++++++++++++++------
src/include/pg_config_manual.h | 6 ++++
src/include/storage/fd.h | 5 ++--
src/tools/pgindent/typedefs.list | 1 +
26 files changed, 112 insertions(+), 48 deletions(-)
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index dcd8120895..b42b9e6c41 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -166,7 +166,7 @@ blbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
BloomFillMetapage(index, metapage);
/*
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 54209924ae..e464d0d4d2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -36,7 +36,7 @@ typedef enum
PREWARM_BUFFER
} PrewarmType;
-static PGAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffer;
/*
* pg_prewarm(regclass, mode text, fork text,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d2f8da5b02..5e0c1447f9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state)
* Write an empty page as a placeholder for the root page. It will be
* replaced with the real root page at the end.
*/
- page = palloc0(BLCKSZ);
+ page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
page, true);
state->pages_allocated++;
@@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
levelstate->current_page++;
if (levelstate->pages[levelstate->current_page] == NULL)
- levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+ levelstate->pages[levelstate->current_page] =
+ palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
newPage = levelstate->pages[levelstate->current_page];
gistinitpage(newPage, old_page_flags);
@@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
/* Create page and copy data */
data = (char *) (dist->list);
- target = palloc0(BLCKSZ);
+ target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
gistinitpage(target, isleaf ? F_LEAF : 0);
for (int i = 0; i < dist->block.num; i++)
{
@@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
if (parent == NULL)
{
parent = palloc0(sizeof(GistSortedBuildLevelState));
- parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
parent->parent = NULL;
gistinitpage(parent->pages[0], 0);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6d8af42260..af3a154266 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -992,7 +992,7 @@ static bool
_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
{
BlockNumber lastblock;
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
Page page;
HashPageOpaque ovflopaque;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ae0282a70e..424958912c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_old_rel = old_heap;
state->rs_new_rel = new_heap;
- state->rs_buffer = (Page) palloc(BLCKSZ);
+ state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 992f84834f..2df8849858 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -154,7 +154,7 @@ btbuildempty(Relation index)
Page metapage;
/* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1207a49689..6ad3f3c54d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level)
Page page;
BTPageOpaque opaque;
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);
@@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
while (blkno > wstate->btws_pages_written)
{
if (!wstate->btws_zeropage)
- wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
+ wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
/* don't set checksum for all-zero page */
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
@@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
- metapage = (Page) palloc(BLCKSZ);
+ metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, rootblkno, rootlevel,
wstate->inskey->allequalimage);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 718a88335d..72d2e1551c 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -158,7 +158,7 @@ spgbuildempty(Relation index)
Page page;
/* Construct metapage. */
- page = (Page) palloc(BLCKSZ);
+ page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
SpGistInitMetapage(page);
/*
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 9f67d1c1cd..6c68191ca6 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -58,14 +58,17 @@ typedef struct
char delta[MAX_DELTA_SIZE]; /* delta between page images */
} PageData;
-/* State of generic xlog record construction */
+/*
+ * State of generic xlog record construction. Must be allocated at an I/O
+ * aligned address.
+ */
struct GenericXLogState
{
+ /* Page images (properly aligned, must be first) */
+ PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
/* Info about each page, see above */
PageData pages[MAX_GENERIC_XLOG_PAGES];
bool isLogged;
- /* Page images (properly aligned) */
- PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES];
};
static void writeFragment(PageData *pageData, OffsetNumber offset,
@@ -269,7 +272,9 @@ GenericXLogStart(Relation relation)
GenericXLogState *state;
int i;
- state = (GenericXLogState *) palloc(sizeof(GenericXLogState));
+ state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
+ PG_IO_ALIGN_SIZE,
+ 0);
state->isLogged = RelationNeedsWAL(relation);
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..3fea8c4082 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4506,7 +4506,7 @@ XLOGShmemSize(void)
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
/* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
+ size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE));
/* and the buffers themselves */
size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
@@ -4603,10 +4603,11 @@ XLOGShmemInit(void)
/*
* Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * This simplifies some calculations in XLOG insertion. We also need I/O
+ * alignment for O_DIRECT, but that's also a power of two and usually
+ * smaller. Take the larger of the two alignment requirements.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr);
XLogCtl->pages = allocptr;
memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index af1491aa1d..2add053489 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -451,7 +451,7 @@ void
RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
Page page;
bool use_wal;
bool copying_initfork;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 20946c47cb..0057443f0c 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,9 +78,12 @@ InitBufferPool(void)
NBuffers * sizeof(BufferDescPadded),
&foundDescs);
+ /* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ, &foundBufs);
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ ShmemInitStruct("Buffer Blocks",
+ NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ &foundBufs));
/* Align condition variables to cacheline boundary. */
BufferIOCVArray = (ConditionVariableMinimallyPadded *)
@@ -163,7 +166,8 @@ BufferShmemSize(void)
/* to allow aligning buffer descriptors */
size = add_size(size, PG_CACHE_LINE_SIZE);
- /* size of data pages */
+ /* size of data pages, plus alignment padding */
+ size = add_size(size, PG_IO_ALIGN_SIZE);
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 908a8934bd..033f230b1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4261,7 +4261,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
bool use_wal;
BlockNumber nblocks;
BlockNumber blkno;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
BufferAccessStrategy bstrategy_src;
BufferAccessStrategy bstrategy_dst;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3846d3eaca..aae02949ce 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -735,8 +735,11 @@ GetLocalBufferStorage(void)
/* And don't overflow MaxAllocSize, either */
num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ);
- cur_block = (char *) MemoryContextAlloc(LocalBufferContext,
- num_bufs * BLCKSZ);
+ /* Buffers should be I/O aligned. */
+ cur_block = (char *)
+ TYPEALIGN(PG_IO_ALIGN_SIZE,
+ MemoryContextAlloc(LocalBufferContext,
+ num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE));
next_buf_in_block = 0;
num_bufs_in_block = num_bufs;
}
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 37ea8ac6b7..84ead85942 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -95,6 +95,12 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
+
+ /*
+ * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
+ * wasting per-file alignment padding when some users create many
+ * files.
+ */
PGAlignedBlock buffer;
};
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 92994f8f39..9a302ddc30 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno)
* and second to avoid wasting space in processes that never call this.
*/
if (pageCopy == NULL)
- pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
+ pageCopy = MemoryContextAllocAligned(TopMemoryContext,
+ BLCKSZ,
+ PG_IO_ALIGN_SIZE,
+ 0);
memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1c2d1405f8..efa9773a4d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
@@ -783,6 +787,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rlocator.locator.spcOid,
reln->smgr_rlocator.locator.dbOid,
@@ -848,6 +856,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ /* If this build supports direct I/O, the buffer must be I/O aligned. */
+ if (PG_O_DIRECT != 0)
+ Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum < mdnblocks(reln, forknum));
@@ -1424,7 +1436,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
*/
if (nblocks < ((BlockNumber) RELSEG_SIZE))
{
- char *zerobuf = palloc0(BLCKSZ);
+ char *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
+ MCXT_ALLOC_ZERO);
mdextend(reln, forknum,
nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 64ea237438..52b8898d5e 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, const void *buffer)
*/
while (blocknum > lts->nBlocksWritten)
{
- PGAlignedBlock zerobuf;
+ PGIOAlignedBlock zerobuf;
MemSet(zerobuf.data, 0, sizeof(zerobuf));
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index aa21007497..19eb67e485 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -183,7 +183,7 @@ skipfile(const char *fn)
static void
scan_file(const char *fn, int segmentno)
{
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
PageHeader header = (PageHeader) buf.data;
int f;
BlockNumber blockno;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index da9d75dccb..4e2a1376c6 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -77,7 +77,7 @@ static void
local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
size_t written_len;
@@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
- PGAlignedBlock buf;
+ PGIOAlignedBlock buf;
char srcpath[MAXPGPATH];
int srcfd;
off_t begin = off;
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index ed874507ff..d173602882 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
{
int src_fd;
int dst_fd;
- PGAlignedBlock buffer;
- PGAlignedBlock new_vmbuf;
+ PGIOAlignedBlock buffer;
+ PGIOAlignedBlock new_vmbuf;
ssize_t totalBytesRead = 0;
ssize_t src_filesize;
int rewriteVmBytesPerPage;
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index d568d83b9f..74833c4acb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -540,8 +540,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
ssize_t
pg_pwrite_zeros(int fd, size_t size, off_t offset)
{
- static const PGAlignedBlock zbuffer = {{0}}; /* worth BLCKSZ */
- void *zerobuf_addr = unconstify(PGAlignedBlock *, &zbuffer)->data;
+ static const PGIOAlignedBlock zbuffer = {{0}}; /* worth BLCKSZ */
+ void *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
struct iovec iov[PG_IOV_MAX];
size_t remaining_size = size;
ssize_t total_written = 0;
diff --git a/src/include/c.h b/src/include/c.h
index 5fe7a97ff0..f69d739be5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1119,14 +1119,11 @@ extern void ExceptionalCondition(const char *conditionName,
/*
* Use this, not "char buf[BLCKSZ]", to declare a field or local variable
- * holding a page buffer, if that page might be accessed as a page and not
- * just a string of bytes. Otherwise the variable might be under-aligned,
- * causing problems on alignment-picky hardware. (In some places, we use
- * this to declare buffers even though we only pass them to read() and
- * write(), because copying to/from aligned buffers is usually faster than
- * using unaligned buffers.) We include both "double" and "int64" in the
- * union to ensure that the compiler knows the value must be MAXALIGN'ed
- * (cf. configure's computation of MAXIMUM_ALIGNOF).
+ * holding a page buffer, if that page might be accessed as a page. Otherwise
+ * the variable might be under-aligned, causing problems on alignment-picky
+ * hardware. We include both "double" and "int64" in the union to ensure that
+ * the compiler knows the value must be MAXALIGN'ed (cf. configure's
+ * computation of MAXIMUM_ALIGNOF).
*/
typedef union PGAlignedBlock
{
@@ -1135,9 +1132,30 @@ typedef union PGAlignedBlock
int64 force_align_i64;
} PGAlignedBlock;
+/*
+ * Use this to declare a field or local variable holding a page buffer, if that
+ * page might be accessed as a page or passed to an SMgr I/O function. If
+ * allocating using the MemoryContext API, the aligned allocation functions
+ * should be used with PG_IO_ALIGN_SIZE. This alignment may be more efficient
+ * for I/O in general, but may be strictly required on some platforms when
+ * using direct I/O.
+ */
+typedef union PGIOAlignedBlock
+{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
+ char data[BLCKSZ];
+ double force_align_d;
+ int64 force_align_i64;
+} PGIOAlignedBlock;
+
/* Same, but for an XLOG_BLCKSZ-sized buffer */
typedef union PGAlignedXLogBlock
{
+#ifdef pg_attribute_aligned
+ pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#endif
char data[XLOG_BLCKSZ];
double force_align_d;
int64 force_align_i64;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index b586ee269a..c799bc2013 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -227,6 +227,12 @@
*/
#define PG_CACHE_LINE_SIZE 128
+/*
+ * Assumed alignment requirement for direct I/O. 4K corresponds to sector size
+ * on modern storage, and works also for older 512 byte sectors.
+ */
+#define PG_IO_ALIGN_SIZE 4096
+
/*
*------------------------------------------------------------------------
* The following symbols are for enabling debugging code, not for
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index daceafd473..faac4914fe 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds;
* to the appropriate Windows flag in src/port/open.c. We simulate it with
* fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper. We use the name
* PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * idea on a Unix). We can only use it if the compiler will correctly align
+ * PGIOAlignedBlock for us, though.
*/
-#if defined(O_DIRECT)
+#if defined(O_DIRECT) && defined(pg_attribute_aligned)
#define PG_O_DIRECT O_DIRECT
#elif defined(F_NOCACHE)
#define PG_O_DIRECT 0x80000000
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3219ea5f05..0313b2c93a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1703,6 +1703,7 @@ PGEventResultDestroy
PGFInfoFunction
PGFileType
PGFunction
+PGIOAlignedBlock
PGLZ_HistEntry
PGLZ_Strategy
PGLoadBalanceType
--
2.39.2
v4-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Add-io_direct-setting-developer-only.patchDownload
From 3db83b7289b85d1c84c5490e1d43e378b5ed3053 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:54:18 +1300
Subject: [PATCH v4 2/3] Add io_direct setting (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
for data and WAL files. This hurts performance currently and is not
intended for end-users yet. Later proposed work would introduce our own
I/O clustering, read-ahead, etc to replace the kernel features that are
disabled with this option.
The only user-visible change, if the developer-only GUC is not used, is
that this commit also removes the obscure logic that would activate
O_DIRECT for the WAL when wal_sync_method=open_[data]sync and
wal_level=minimal (which also requires max_wal_senders=0). Those are
non-default and unlikely settings, and this behavior wasn't (correctly)
documented. In the unlikely event that a user wants that functionality
back, io_direct=wal is a more direct way to say so.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
---
doc/src/sgml/config.sgml | 34 ++++++++-
src/backend/access/transam/xlog.c | 37 ++++-----
src/backend/access/transam/xlogprefetcher.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 16 ++--
src/backend/storage/buffer/localbuf.c | 7 +-
src/backend/storage/file/fd.c | 76 +++++++++++++++++++
src/backend/storage/smgr/md.c | 24 ++++--
src/backend/storage/smgr/smgr.c | 1 +
src/backend/utils/misc/guc_tables.c | 12 +++
src/include/storage/fd.h | 7 ++
src/include/storage/smgr.h | 1 +
src/include/utils/guc_hooks.h | 2 +
src/test/modules/test_misc/meson.build | 1 +
src/test/modules/test_misc/t/004_io_direct.pl | 48 ++++++++++++
14 files changed, 233 insertions(+), 35 deletions(-)
create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 25111d5caf..fc885c43a8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3155,7 +3155,6 @@ include_dir 'conf.d'
</listitem>
</itemizedlist>
<para>
- The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available.
Not all of these choices are available on all platforms.
The default is the first method in the above list that is supported
by the platform, except that <literal>fdatasync</literal> is the default on
@@ -11241,6 +11240,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-io-direct" xreflabel="io_direct">
+ <term><varname>io_direct</varname> (<type>string</type>)
+ <indexterm>
+ <primary><varname>io_direct</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Ask the kernel to minimize caching effects for relation data and WAL
+ files using <literal>O_DIRECT</literal> (most Unix-like systems),
+ <literal>F_NOCACHE</literal> (macOS) or
+ <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows).
+ </para>
+ <para>
+ May be set to an empty string (the default) to disable use of direct
+ I/O, or a comma-separated list of types of files for which direct I/O
+ is enabled. The valid types of file are <literal>data</literal> for
+ main data files, <literal>wal</literal> for WAL files, and
+ <literal>wal_init</literal> for WAL files when being initially
+ allocated.
+ </para>
+ <para>
+ Some operating systems and file systems do not support direct I/O, so
+ non-default settings may be rejected at startup, or produce I/O errors
+ at runtime.
+ </para>
+ <para>
+ Currently this feature reduces performance, and is intended for
+ developer testing only.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay">
<term><varname>post_auth_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3fea8c4082..7a555d8701 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2926,6 +2926,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
XLogSegNo max_segno;
int fd;
int save_errno;
+ int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
Assert(logtli != 0);
@@ -2959,8 +2960,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
unlink(tmppath);
+ if (io_direct_flags & IO_DIRECT_WAL_INIT)
+ open_flags |= PG_O_DIRECT;
+
/* do not use get_sync_bit() here --- want to fsync only at end of fill */
- fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = BasicOpenFile(tmppath, open_flags);
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -3354,7 +3358,7 @@ XLogFileClose(void)
* use the cache to read the WAL segment.
*/
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
- if (!XLogIsNeeded())
+ if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
@@ -4445,7 +4449,6 @@ show_in_hot_standby(void)
return RecoveryInProgress() ? "on" : "off";
}
-
/*
* Read the control file, set respective GUCs.
*
@@ -8030,35 +8033,27 @@ xlog_redo(XLogReaderState *record)
}
/*
- * Return the (possible) sync flag used for opening a file, depending on the
- * value of the GUC wal_sync_method.
+ * Return the extra open flags used for opening a file, depending on the
+ * value of the GUCs wal_sync_method, fsync and io_direct.
*/
static int
get_sync_bit(int method)
{
int o_direct_flag = 0;
- /* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
- return 0;
-
/*
- * Optimize writes by bypassing kernel cache with O_DIRECT when using
- * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
- * otherwise the archive command or walsender process will read the WAL
- * soon after writing it, which is guaranteed to cause a physical read if
- * we bypassed the kernel cache. We also skip the
- * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
- * reason.
- *
- * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
+ * Use O_DIRECT if requested, except in walreceiver process. The WAL
* written by walreceiver is normally read by the startup process soon
- * after it's written. Also, walreceiver performs unaligned writes, which
+ * after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
- if (!XLogIsNeeded() && !AmWalReceiverProcess())
+ if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
+ /* If fsync is disabled, never open in sync mode */
+ if (!enableFsync)
+ return o_direct_flag;
+
switch (method)
{
/*
@@ -8070,7 +8065,7 @@ get_sync_bit(int method)
case SYNC_METHOD_FSYNC:
case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
- return 0;
+ return o_direct_flag;
#ifdef O_SYNC
case SYNC_METHOD_OPEN:
return O_SYNC | o_direct_flag;
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 046e40d143..7ba18f2a76 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
block->prefetch_buffer = InvalidBuffer;
return LRQ_NEXT_IO;
}
- else
+ else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
{
/*
* This shouldn't be possible, because we already determined
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 033f230b1d..5cb026d1ca 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -541,8 +541,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* Try to initiate an asynchronous read. This returns false in
* recovery if the relation file doesn't exist.
*/
- if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr_reln, forkNum, blockNum))
+ {
result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
else
@@ -588,11 +591,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
* the kernel and therefore didn't really initiate I/O, and no way to know when
* the I/O completes other than using synchronous ReadBuffer().
*
- * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and
* USE_PREFETCH is not defined (this build doesn't support prefetching due to
- * lack of a kernel facility), or the underlying relation file wasn't found and
- * we are in recovery. (If the relation file wasn't found and we are not in
- * recovery, an error is raised).
+ * lack of a kernel facility), direct I/O is enabled, or the underlying
+ * relation file wasn't found and we are in recovery. (If the relation file
+ * wasn't found and we are not in recovery, an error is raised).
*/
PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
@@ -5451,6 +5454,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
{
PendingWriteback *pending;
+ if (io_direct_flags & IO_DIRECT_DATA)
+ return;
+
/*
* Add buffer to the pending writeback array, unless writeback control is
* disabled.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index aae02949ce..c6384c9fde 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -92,8 +92,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
{
#ifdef USE_PREFETCH
/* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
- result.initiated_io = true;
+ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ smgrprefetch(smgr, forkNum, blockNum))
+ {
+ result.initiated_io = true;
+ }
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a280a1e7be..ccc789dc03 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -98,7 +98,9 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
+#include "utils/guc_hooks.h"
#include "utils/resowner_private.h"
+#include "utils/varlena.h"
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
#if defined(HAVE_SYNC_FILE_RANGE)
@@ -162,6 +164,9 @@ bool data_sync_retry = false;
/* How SyncDataDirectory() should do its job. */
int recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+/* Which kinds of files should be opened with PG_O_DIRECT. */
+int io_direct_flags;
+
/* Debugging.... */
#ifdef FDDEBUG
@@ -2022,6 +2027,9 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
if (nbytes <= 0)
return;
+ if (VfdCache[file].fileFlags & PG_O_DIRECT)
+ return;
+
returnCode = FileAccess(file);
if (returnCode < 0)
return;
@@ -3826,3 +3834,71 @@ data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}
+
+bool
+check_io_direct(char **newval, void **extra, GucSource source)
+{
+ int flags;
+
+#if PG_O_DIRECT == 0
+ if (strcmp(*newval, "") != 0)
+ {
+ GUC_check_errdetail("io_direct is not supported on this platform.");
+ return false;
+ }
+ flags = 0;
+#else
+ List *elemlist;
+ ListCell *l;
+ char *rawstring;
+
+ /* Need a modifiable copy of string */
+ rawstring = pstrdup(*newval);
+
+ if (!SplitGUCList(rawstring, ',', &elemlist))
+ {
+ GUC_check_errdetail("invalid list syntax in parameter \"%s\"",
+ "io_direct");
+ pfree(rawstring);
+ list_free(elemlist);
+ return false;
+ }
+
+ flags = 0;
+ foreach (l, elemlist)
+ {
+ char *item = (char *) lfirst(l);
+
+ if (pg_strcasecmp(item, "data") == 0)
+ flags |= IO_DIRECT_DATA;
+ else if (pg_strcasecmp(item, "wal") == 0)
+ flags |= IO_DIRECT_WAL;
+ else if (pg_strcasecmp(item, "wal_init") == 0)
+ flags |= IO_DIRECT_WAL_INIT;
+ else
+ {
+ GUC_check_errdetail("invalid option \"%s\"", item);
+ pfree(rawstring);
+ list_free(elemlist);
+ return false;
+ }
+ }
+
+ pfree(rawstring);
+ list_free(elemlist);
+#endif
+
+ /* Save the flags in *extra, for use by assign_io_direct */
+ *extra = guc_malloc(ERROR, sizeof(int));
+ *((int *) *extra) = flags;
+
+ return true;
+}
+
+extern void
+assign_io_direct(const char *newval, void *extra)
+{
+ int *flags = (int *) extra;
+
+ io_direct_flags = *flags;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index efa9773a4d..5647abeffd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static inline int
+_mdfd_open_flags(ForkNumber forkNum)
+{
+ int flags = O_RDWR | PG_BINARY;
+
+ if (io_direct_flags & IO_DIRECT_DATA)
+ flags |= PG_O_DIRECT;
+
+ return flags;
+}
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL);
if (fd < 0)
{
int save_errno = errno;
if (isRedo)
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
/* be sure to report the error reported by create, not open */
@@ -635,7 +645,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
path = relpath(reln->smgr_rlocator, forknum);
- fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ fd = PathNameOpenFile(path, _mdfd_open_flags(forknum));
if (fd < 0)
{
@@ -706,6 +716,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
off_t seekpos;
MdfdVec *v;
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
v = _mdfd_getseg(reln, forknum, blocknum, false,
InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
if (v == NULL)
@@ -731,6 +743,8 @@ void
mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks)
{
+ Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
/*
* Issue flush requests in as few requests as possible; have to split at
* segment boundaries though, since those are actually separate files.
@@ -1330,7 +1344,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
fullpath = _mdfd_segpath(reln, forknum, segno);
/* open the file */
- fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+ fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags);
pfree(fullpath);
@@ -1540,7 +1554,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
strlcpy(path, p, MAXPGPATH);
pfree(p);
- file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+ file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum));
if (file < 0)
return -1;
need_to_close = true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c37c246b77..70d0d570b1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
#include "storage/bufmgr.h"
+#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/md.h"
#include "storage/smgr.h"
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e8e8245e91..d3ed527e3b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -568,6 +568,7 @@ static char *locale_ctype;
static char *server_encoding_string;
static char *server_version_string;
static int server_version_num;
+static char *io_direct_string;
#ifdef HAVE_SYSLOG
#define DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4565,6 +4566,17 @@ struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Use direct I/O for file access."),
+ NULL,
+ GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
+ },
+ &io_direct_string,
+ "",
+ check_io_direct, assign_io_direct, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index faac4914fe..6791a406fc 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -44,6 +44,7 @@
#define FD_H
#include <dirent.h>
+#include <fcntl.h>
typedef enum RecoveryInitSyncMethod
{
@@ -54,10 +55,16 @@ typedef enum RecoveryInitSyncMethod
typedef int File;
+#define IO_DIRECT_DATA 0x01
+#define IO_DIRECT_WAL 0x02
+#define IO_DIRECT_WAL_INIT 0x04
+
+
/* GUC parameter */
extern PGDLLIMPORT int max_files_per_process;
extern PGDLLIMPORT bool data_sync_retry;
extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_direct_flags;
/*
* This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..17fba6f91a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
#include "lib/ilist.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "utils/guc.h"
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index f722fb250a..a82a85c940 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -156,5 +156,7 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
GucSource source);
extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern bool check_io_direct(char **newval, void **extra, GucSource source);
+extern void assign_io_direct(const char *newval, void *extra);
#endif /* GUC_HOOKS_H */
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 21bde427b4..911084ac0f 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
't/001_constraint_validation.pl',
't/002_tablespace.pl',
't/003_check_guc.pl',
+ 't/004_io_direct.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
new file mode 100644
index 0000000000..78646e945e
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -0,0 +1,48 @@
+# Very simple exercise of direct I/O GUC.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Systems that we know to have direct I/O support, and whose typical local
+# filesystems support it or at least won't fail with an error. (illumos should
+# probably be in this list, but perl reports it as solaris. Solaris should not
+# be in the list because we don't support its way of turning on direct I/O, and
+# even if we did, its version of ZFS rejects it, and OpenBSD just doesn't have
+# it.)
+if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd))
+{
+ plan skip_all => "no direct I/O support";
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+io_direct = 'data,wal,wal_init'
+shared_buffers = '256kB' # tiny to force I/O
+});
+$node->start;
+
+# Do some work that is bound to generate shared and local writes and reads as a
+# simple exercise.
+$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)');
+$node->safe_psql('postgres', 'create table t2count (i int)');
+$node->safe_psql('postgres', qq{
+begin;
+create temporary table t2 as select 1 as i from generate_series(1, 10000);
+update t2 set i = i;
+insert into t2count select count(*) from t2;
+commit;
+});
+$node->safe_psql('postgres', 'update t1 set i = i');
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local");
+$node->stop('immediate');
+
+$node->start;
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery");
+$node->stop;
+
+done_testing();
--
2.39.2
v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload
From f5318b888ad14f4f88ccd71511c64b3b990d939b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 13 Dec 2022 16:55:09 +1300
Subject: [PATCH v4 3/3] XXX turn on direct I/O by default, just for CI
---
src/backend/utils/misc/guc_tables.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d3ed527e3b..2f95b86e19 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4573,7 +4573,7 @@ struct config_string ConfigureNamesString[] =
GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE
},
&io_direct_string,
- "",
+ "data,wal,wal_init",
check_io_direct, assign_io_direct, NULL
},
--
2.39.2
I did some testing with non-default block sizes, and found a few minor
things that needed adjustment. The short version is that I blocked
some configurations that won't work or would break an assertion.
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.
The adjustments were:
1. If you set your BLCKSZ or XLOG_BLCKSZ smaller than
PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for
the relevant operations, because such undersized direct I/Os will fail
on common systems.
FATAL: invalid value for parameter "io_direct": "wal"
DETAIL: io_direct is not supported for WAL because XLOG_BLCKSZ is too small
FATAL: invalid value for parameter "io_direct": "data"
DETAIL: io_direct is not supported for data because BLCKSZ is too small
In fact some systems would be OK with it if the true requirement is
512 not 4096, but (1) tiny blocks are a niche build option that
doesn't even pass regression tests and (2) it's hard and totally
unportable to find out the true requirement at runtime, and (3) the
conservative choice of 4096 has additional benefits by matching memory
pages. So I think a conservative compile-time number is a good
starting position.
2. Previously I had changed the WAL buffer alignment to be the larger
of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above
thinking, I reverted that part (no point in aligning the address of
the buffer when the size is too small for direct I/O, but now that
combination is blocked off at GUC level so we don't need any change
here).
3. I updated the md.c alignment assertions to allow for tiny blocks.
The point of these assertions is to fail if any new code does I/O from
badly aligned buffers even with io_direct turned off (ie how most
people hack), 'cause that will fail with io_direct turned on. The
change is that I don't make the assertion if you're using BLCKSZ <
PG_IO_ALIGN_SIZE. Such buffers wouldn't work if used for direct I/O
but that's OK, the GUC won't allow it.
4. I made the language to explain where PG_IO_ALIGN_SIZE really comes
from a little vaguer because it's complex.
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.
Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
Thomas Munro <thomas.munro@gmail.com> writes:
I did some testing with non-default block sizes, and found a few minor
things that needed adjustment.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2004%3A42%3A04
This seems like another thing that should not have been pushed mere
hours before feature freeze.
regards, tom lane
Hi,
On 2023-04-08 16:59:20 +1200, Thomas Munro wrote:
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
There were some failures in CI (e.g. [1]https://cirrus-ci.com/task/4519721039560704 (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.
But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.
#4 0x00005624dfe90336 in errfinish (filename=0x5624df6867c0 "../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c", lineno=353,
funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at ../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604
#5 0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353
#6 0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601
#7 0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, strategy=0x0, foundPtr=0x7ffd418214a3,
io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290
#8 0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0,
hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056
#9 0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776
#10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45, page_std=false)
at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290
#11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false)
at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837
#12 0x00005624df8d1dd2 in CommitTransaction () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225
#13 0x00005624df8d2da2 in CommitTransactionCommand () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060
#14 0x00005624dfcbe0a1 in finish_xact_command () at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779
#15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from generate_series(1, 10000)")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299
#16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623
#17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
#18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
#19 0x00005624dfbe8541 in ServerLoop () at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
#20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463
#21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/main/main.c:200
If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.
Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.
It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?
Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?
Greetings,
Andres Freund
On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
No particular ideas about what happened there yet. It *looks* like we
wrote out a page, and then read it back in very soon afterwards, all
via the usual locked bufmgr/smgr pathways, and it failed basic page
validation. The reader was a parallel worker, because of the
debug_parallel_mode setting on that box. The page number looks
reasonable (I guess it's reading a page created by the UPDATE full of
new tuples, but I don't know which process wrote it). It's also not
immediately obvious how this could be connected to the 32 pinned
buffer problem (all that would have happened in the CREATE TABLE
process which ended already before the UPDATE and then the SELECT
backends even started).
Andrew, what file system and type of disk is that machine using?
Hi,
On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.[backtrace]
If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?
Here's a quick prototype of this approach. If we expose LimitAdditionalPins(),
we'd probably want to add "Buffer" to the name, and pass it a relation, so
that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite
in question doesn't need it, but ...
Without the limiting of pins the modified 004_io_direct.pl fails 100% of the
time for me.
Presumably the reason it fails occasionally with 256kB of shared buffers
(i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single
buffer. As log_newpage_range() thinks it can just pin 32 buffers
unconditionally, it fails in that case.
Greetings,
Andres Freund
Attachments:
limit-pins.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/storage/bufmgr.h w/src/include/storage/bufmgr.h
index 6ab00daa2ea..e5788309c86 100644
--- i/src/include/storage/bufmgr.h
+++ w/src/include/storage/bufmgr.h
@@ -223,6 +223,7 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
+extern void LimitAdditionalPins(uint32 *additional_pins);
#define RelationGetNumberOfBlocks(reln) \
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
diff --git i/src/backend/access/transam/xloginsert.c w/src/backend/access/transam/xloginsert.c
index e2a5a3d13ba..2189fc9f71f 100644
--- i/src/backend/access/transam/xloginsert.c
+++ w/src/backend/access/transam/xloginsert.c
@@ -1268,8 +1268,8 @@ log_newpage_range(Relation rel, ForkNumber forknum,
/*
* Iterate over all the pages in the range. They are collected into
- * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
- * for each batch.
+ * batches of up to XLR_MAX_BLOCK_ID pages, and a single WAL-record is
+ * written for each batch.
*/
XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);
@@ -1278,14 +1278,18 @@ log_newpage_range(Relation rel, ForkNumber forknum,
{
Buffer bufpack[XLR_MAX_BLOCK_ID];
XLogRecPtr recptr;
- int nbufs;
+ uint32 limit = XLR_MAX_BLOCK_ID;
+ int nbufs;
int i;
CHECK_FOR_INTERRUPTS();
+ /* avoid running out pinnable buffers */
+ LimitAdditionalPins(&limit);
+
/* Collect a batch of blocks. */
nbufs = 0;
- while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
+ while (nbufs < limit && blkno < endblk)
{
Buffer buf = ReadBufferExtended(rel, forknum, blkno,
RBM_NORMAL, NULL);
diff --git i/src/backend/storage/buffer/bufmgr.c w/src/backend/storage/buffer/bufmgr.c
index 7778dde3e57..31c75d6240e 100644
--- i/src/backend/storage/buffer/bufmgr.c
+++ w/src/backend/storage/buffer/bufmgr.c
@@ -1742,7 +1742,7 @@ again:
* pessimistic, but outside of toy-sized shared_buffers it should allow
* sufficient pins.
*/
-static void
+void
LimitAdditionalPins(uint32 *additional_pins)
{
uint32 max_backends;
diff --git i/src/test/modules/test_misc/t/004_io_direct.pl w/src/test/modules/test_misc/t/004_io_direct.pl
index f5bf0b11e4e..5791c2ab7bd 100644
--- i/src/test/modules/test_misc/t/004_io_direct.pl
+++ w/src/test/modules/test_misc/t/004_io_direct.pl
@@ -22,7 +22,7 @@ $node->init;
$node->append_conf(
'postgresql.conf', qq{
io_direct = 'data,wal,wal_init'
-shared_buffers = '256kB' # tiny to force I/O
+shared_buffers = '16' # tiny to force I/O
});
$node->start;
Hi,
Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().
Greetings,
Andres Freund
On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().
Will do.
Thomas Munro <thomas.munro@gmail.com> writes:
On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().
Will do.
Now crake is doing this:
2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1
The fact that the error is happening in a parallel worker seems
interesting ...
(BTW, why are the log lines doubly timestamped?)
regards, tom lane
On Sun, Apr 9, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
The fact that the error is happening in a parallel worker seems
interesting ...
That's because it's running with debug_parallel_query=regress. I've
been trying to repro that but no luck... A different kind of failure
also showed up, where it counted the wrong number of tuples:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2015%3A52%3A03
A paranoid explanation would be that this system is failing to provide
basic I/O coherency, we're writing pages out and not reading them back
in. Or of course there is a dumb bug... but why only here? Can of
course be timing-sensitive and it's interesting that crake suffers
from the "no unpinned buffers available" thing (which should now be
gone) with higher frequency; I'm keen to see if the dodgy-read problem
continues with a similar frequency now.
Hi,
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
Now crake is doing this:2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1The fact that the error is happening in a parallel worker seems
interesting ...
There were a few prior instances of that error. One that I hadn't seen before
is this:
[11:35:07.190](0.001s) # Failed test 'read back from shared'
# at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) # got: '10000'
# expected: '10098'
For one it points to the arguments to is() being switched around, but that's a
sideshow.
(BTW, why are the log lines doubly timestamped?)
It's odd.
It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.
I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
(BTW, why are the log lines doubly timestamped?)
It's odd.
Oh, I guess that's intentional, because crake has
'log_line_prefix = \'%m [%s %p:%l] %q%a \'',
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.
Indeed. I'm guessing from the compiler version that it's Fedora 37 now
(the lack of such basic information in the meson configuration output
is pretty annoying). I've been trying to repro it here on an F37 box,
with no success, suggesting that it's very timing sensitive. Or maybe
it's inside a VM and that matters?
regards, tom lane
Hi,
On 2023-04-08 17:31:02 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.Indeed. I'm guessing from the compiler version that it's Fedora 37 now
The 15 branch says:
hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023
So at least the kernel claims to be 36...
(the lack of such basic information in the meson configuration output
is pretty annoying).
Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.
I've been trying to repro it here on an F37 box, with no success, suggesting
that it's very timing sensitive. Or maybe it's inside a VM and that
matters?
Could also be filesystem specific?
Greetings,
Andres Freund
On 2023-04-08 Sa 17:42, Andres Freund wrote:
Hi,
On 2023-04-08 17:31:02 -0400, Tom Lane wrote:
Andres Freund<andres@anarazel.de> writes:
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.Indeed. I'm guessing from the compiler version that it's Fedora 37 now
The 15 branch says:
hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023So at least the kernel claims to be 36...
(the lack of such basic information in the meson configuration output
is pretty annoying).Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.I've been trying to repro it here on an F37 box, with no success, suggesting
that it's very timing sensitive. Or maybe it's inside a VM and that
matters?Could also be filesystem specific?
I migrated it in February from a VM to a non-virtual instance. Almost
nothing else runs on the machine. The personality info shown on the BF
server is correct.
andrew@neoemma:~ $ cat /etc/fedora-release
Fedora release 36 (Thirty Six)
andrew@neoemma:~ $ uname -a
Linux neoemma 6.2.8-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 22
19:14:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
andrew@neoemma:~ $ gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
andrew@neoemma:~ $ mount | grep home
/dev/mapper/luks-xxxxxxx on /home type btrfs
(rw,relatime,seclabel,compress=zstd:1,ssd,discard=async,space_cache,subvolid=256,subvol=/home)
I guess it could be btrfs-specific. I'll be somewhat annoyed if I have
to re-init the machine to use something else.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:
btrfs
Aha!
On 2023-04-08 Sa 17:23, Andres Freund wrote:
Hi,
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
Thomas Munro<thomas.munro@gmail.com> writes:
Now crake is doing this:2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1The fact that the error is happening in a parallel worker seems
interesting ...There were a few prior instances of that error. One that I hadn't seen before
is this:[11:35:07.190](0.001s) # Failed test 'read back from shared'
# at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) # got: '10000'
# expected: '10098'For one it points to the arguments to is() being switched around, but that's a
sideshow.It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.
I can run the test in isolation, and it's get an error reliably.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote:
I can run the test in isolation, and it's get an error reliably.
Random idea: it looks like you have compression enabled. What if you
turn it off in the directory where the test runs? Something like
btrfs property set <file> compression ... according to the
intergoogles. (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).
Thomas Munro <thomas.munro@gmail.com> writes:
On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:
btrfs
Aha!
Googling finds a lot of suggestions that O_DIRECT doesn't play nice
with btrfs, for example
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html
It's not clear to me how much of that lore is still current,
but it's disturbing.
regards, tom lane
On Sun, Apr 9, 2023 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Googling finds a lot of suggestions that O_DIRECT doesn't play nice
with btrfs, for examplehttps://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html
It's not clear to me how much of that lore is still current,
but it's disturbing.
I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.
Here's something recent. I guess it's probably not relevant (a fault
on our buffer that we recently touched sounds pretty unlikely), but
who knows... (developer lists for file systems are truly terrifying
places to drive through).
https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/
It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*. What are
we doing differently? Are our multiple processes a factor here,
breaking some coherency logic? Unsurprisingly, having compression on
as Andrew does actually involves buffering anyway[1]https://btrfs.readthedocs.io/en/latest/Compression.html despite our
O_DIRECT flag, but maybe that's saying writes are buffered but reads
are still direct (?), which sounds like the sort of initial conditions
that might produce a coherency bug. I dunno.
I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]https://fedoraproject.org/wiki/Changes/BtrfsByDefault?). I wonder if any of the several
green Fedora systems in the 'farm are using btrfs. I wonder if they
are using different mount options (thinking again of compression).
*Probably a good reason to add a more prominent warning that the
feature is developer-only, experimental and not for production use.
I'm thinking a warning at startup or something.
[1]: https://btrfs.readthedocs.io/en/latest/Compression.html
[2]: https://fedoraproject.org/wiki/Changes/BtrfsByDefault
Thomas Munro <thomas.munro@gmail.com> writes:
It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*.
Yeah, it seems moderately likely that it's our own bug ... but this
code's all file-system-ignorant, so how? Maybe we are breaking some
POSIX rule that btrfs exploits but others don't?
I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]?).
I have a ton of Fedora images laying about, and I doubt that any of them
use btrfs, mainly because that's not the default in the "server spin"
which is what I usually install from. It's hard to guess about the
buildfarm, but it wouldn't surprise me that most of them are on xfs.
(If we haven't figured this out pretty shortly, I'm probably going to
put together a btrfs-on-bare-metal machine to try to duplicate crake's
results.)
regards, tom lane
Hi,
On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:
I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.
Oh, but we actually *do* modify pages while IO is going on. I wonder if you
hit the jack pot here. The content lock doesn't prevent hint bit
writes. That's why we copy the page to temporary memory when computing
checksums.
I think we should modify the test to enable checksums - if the problem goes
away, then it's likely to be related to modifying pages while an O_DIRECT
write is ongoing...
Greetings,
Andres Freund
On Sun, Apr 9, 2023 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:
I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.Oh, but we actually *do* modify pages while IO is going on. I wonder if you
hit the jack pot here. The content lock doesn't prevent hint bit
writes. That's why we copy the page to temporary memory when computing
checksums.
More like the jackpot hit me.
Woo, I can now reproduce this locally on a loop filesystem.
Previously I had missed a step, the parallel worker seems to be
necessary. More soon.
On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.[backtrace]
I get an ERROR, not a PANIC:
$ git rev-parse HEAD
2e57ffe12f6b5c1498f29cb7c0d9e17c797d9da6
$ git diff -U0
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
index f5bf0b1..8f0241b 100644
--- a/src/test/modules/test_misc/t/004_io_direct.pl
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -25 +25 @@ io_direct = 'data,wal,wal_init'
-shared_buffers = '256kB' # tiny to force I/O
+shared_buffers = 16
$ ./configure -C --enable-debug --enable-cassert --enable-depend --enable-tap-tests --with-tcl --with-python --with-perl
$ make -C src/test/modules/test_misc check PROVE_TESTS=t/004_io_direct.pl
# +++ tap check in src/test/modules/test_misc +++
t/004_io_direct.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
No subtests run
Test Summary Report
-------------------
t/004_io_direct.pl (Wstat: 7424 Tests: 0 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=1, Tests=0, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.41 cusr 0.14 csys = 0.56 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:460: check] Error 1
$ grep pinned src/test/modules/test_misc/tmp_check/log/*
src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log:2023-04-08 21:12:46.781 PDT [929628] 004_io_direct.pl ERROR: no unpinned buffers available
src/test/modules/test_misc/tmp_check/log/regress_log_004_io_direct:error running SQL: 'psql:<stdin>:1: ERROR: no unpinned buffers available'
No good reason to PANIC there, so the path to PANIC may be fixable.
If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.
Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?
I would not. This is only going to come up where the user goes out of the way
to use near-minimum shared_buffers.
Here's a quick prototype of this approach.
This looks fine. I'm not enthusiastic about incurring post-startup cycles to
cater to allocating less than 512k*max_connections of shared buffers, but I
expect the cycles in question are negligible here.
Indeed, I can't reproduce this with (our) checksums on. I also can't
reproduce it with O_DIRECT off. I also can't reproduce it if I use
"mkdir pgdata && chattr +C pgdata && initdb -D pgdata" to have a
pgdata directory with copy-on-write and (their) checksums disabled.
But it reproduces quite easily with COW on (default behaviour) with
io_direct=data, debug_parallel_query=debug, create table as ...;
update ...; select count(*) ...; from that test.
Unfortunately my mental model of btrfs is extremely limited, basically
just "something a bit like ZFS". FWIW I've been casually following
along with OpenZFS's ongoing O_DIRECT project, and I know that the
plan there is to make a temporary stable copy if checksums and other
features are on (a bit like PostgreSQL does for the same reason, as
you reminded us). Time will tell how that works out but it *seems*
like all available modes would therefore work correctly for us, with
different tradeoffs (ie if you want the fastest zero-copy I/O, don't
use checksums, compression, etc).
Here, btrfs seems to be taking a different path that I can't quite
make out... I see no warning/error about a checksum failure like [1]https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs,
and we apparently managed to read something other than a mix of the
old and new page contents (which, based on your hypothesis, should
just leave it indeterminate whether the hint bit changes were captured
or not, and the rest of the page should be stable, right). It's like
the page time-travelled or got scrambled in some other way, but it
didn't tell us? I'll try to dig further...
[1]: https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs
On 2023-04-08 Sa 18:50, Thomas Munro wrote:
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan<andrew@dunslane.net> wrote:
I can run the test in isolation, and it's get an error reliably.
Random idea: it looks like you have compression enabled. What if you
turn it off in the directory where the test runs? Something like
btrfs property set <file> compression ... according to the
intergoogles. (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).
Didn't seem to make any difference.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here, btrfs seems to be taking a different path that I can't quite
make out... I see no warning/error about a checksum failure like [1],
and we apparently managed to read something other than a mix of the
old and new page contents (which, based on your hypothesis, should
just leave it indeterminate whether the hint bit changes were captured
or not, and the rest of the page should be stable, right). It's like
the page time-travelled or got scrambled in some other way, but it
didn't tell us? I'll try to dig further...
I think there are two separate bad phenomena.
1. A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as . I can
reproduce that with a stand-alone program, attached. The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line. The system log/dmesg gets some
warnings.
2. The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved. I attach a complete syscall trace of a repro session. (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.) Working back from the error message,
ERROR: invalid page in block 78 of relation base/5/16384,
we have a page at offset 638976, and we can find all system calls that
touched that offset:
[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192
[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192
[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192
In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing. We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it. This was obtained with:
strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D
pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal
-c max_wal_senders=0 2>&1 | tee trace.log
The repro is just:
set debug_parallel_query=regress;
drop table if exists t;
create table t as select generate_series(1, 10000);
update t set generate_series = 1;
select count(*) from t;
Occasionally it fails in a different way: after create table t, later
references to t can't find it in the catalogs but there is no invalid
page error. Perhaps the freaky zeros are happening one 4k page at a
time but perhaps if you get two in a row it might look like an empty
catalog page and pass validation.
Attachments:
repro-strace.log.gzapplication/gzip; name=repro-strace.log.gzDownload
���2d repro-strace.log �]�s��������yO��>^��V��D[rI�U��X�(�L�Z�r����� eND��$�(Lwc0}�����������RD�Q��%���/�>.n�����|�8���U4������"_�,������;�z��������7p����4ZE��a��?]���2KV�����h����~2���1����_�{4������v:��F��������y6O�H��?#���%D ����#!F��e>���7���/�1��Q�������o�����RJ�X$2B��R���mtW�<j���������������/�g���w'���/�~>�>��>��_���������!��@j��,�sRB��$���R�J�f��|��n��Q
�p9����8������������(�O>�&�Y6Z,G��/�|�����w�<Z�<����/o`Zk�l=5V�p6��?����n������It����o?sw��Q�p��7�o�r1~��������D5T�����f5��/�k&�����W�����?��Y��J�<���NMh�����G���.(�S��&��2u�i�L6��1�f��n��$�'��'�����������%�b���(Ix����m����x.��x?o��6
��P�6��i-{M�3�3#x��z}����s����m>H�W��.���j|�����/DKy0�o����� V� ��{��l�`���'�f&I�|���i����@ xy��H��W����E!N����m�����P\
�h#Q[�}L����`�qBFJ�cb��y����1���S����;.�h�Ei1k���o���>P��?������(��y���q��<>�����
Z�� �6��%V*��3*r�7���U�5~b ���IjK���w��HhJ w7�!���~=}S���1_����U!WF��w�4g�1����Y1t�i!o;��i��P��E����h��M��7y�M����t
����������$�j��R�P{{����js.dS�����jsPN���v�p=D�pq�YZ,n������wI��J$B�X!��L���-(Vl��e�`���1U����]����@K�j���_�_N����i�P������}��C����)F4�P���oF������v�V)�N�H�Um��\��h���.�{2h&�U��U�t�u�i*#lMx`M�AJ'���C��JM3��(xB�|�"�T��2l��(){�-m-X��l�����VE"DK��@������oW���]��UC�� [G��@�
�=RpA0A�{kZMx�-%Q�
��<!8G��1e -�;����YAd�uI�gm�N���5�,�����S,D��BHt�@�BtIE������\���;\k{�*�4�;d����hW����������aW��koW�72+���2+F[����_�k�@f�/h+���$����BI���J����V ���&�D7S��f�Ie����6)A�T��e����$,
!D�jRp��)8��&�R�o#�|���a
M��m��n<SB!��)���)�Z�n�ehx��9j~�axYI?�1Y�$KHl2�T��
�\~�B����*�O��!��TW�2�������)b��$���p���2�!V
�eE�)�B`HXN�`�UR���4�����Y�]t;p Dm}��Y�p1����j���R�)����1��`*�-�8�bL� @���
N�J�����Z��[���Z������-hI��@����������?����4�)�*�����_��^�W��e��P�
�����5X� Bu����a�A�g�� aV`MR%6�\^�@ �#�#x���l@��Q8�����:�i���?�U:�sFU+5�)���e7T���Bh�1T5�d.|�S+0���n
^K�8������ i}4��*K��K��R!���Fu8#��%Q��b(��
��1�CJ���������*��y4Z�
�51=d�����@��}��Pe!MmX�=��u���2=� oI�5�[����_�y_%���
�`�sq��� c1��>(#���cl����`������t(����W�=��M
�p��=
J2l������w���p�C����9\��������Q�d��a���}F�,9u�J8n�����3
O-�^�+
#-m�o�(��.%~��DF�X`r}��F���� ��0zT|!,����J�[��ZI��Y��d-�T����MH�}�&�B�
8-�3_�jk]rR���J���:���5hp�j��7y3��n/������D�)� �z���UXJ?)2��������81����V�rciMIUk��������2+@�����c�R#���1UJ(k�ck�������`h�0F��SJO*�L���m��B��%� �C
B]��h
�P���b����Mu,^��J��1[yW��w�k���U�< c���������$�3�X
����w)K��\��z��
�����l�l���$�G�.ZktQ�0oWp'~ ���4V�23��@������{�(��*�W��*�� �@|\4�������
J��Pe�W������`~w�X�Zl"kw<e:K��=�k�3���x <�QY�Mg8^��A����P��4��T k.���k���
j��KF1_=$lH�!>�C6�W7�����zjS��l/nF��������6����uqbU}
�E��&!bmP�\C�s6���"� `F
���.p������\E�g�|?��Q�m�� *��l
a�h>�����|�&��gL�P��#���.5K�Z"DC�� /�a�)��n)6���|�k��F25@��E��b� !���
9�a��@Z��Q��Y[9�#R���E(�;�!��K�R $Du(������H{�!kj���]�x�$�����s���{��.��+kx���-���y�|������Xm���s��#������a����/�1d�L&��6R|��$�1�8�cf2�s
D��e�1�ay}3�'E�����(��^S+x��VC�3QV�����~���@i$*�kJ��N=�W(��x���w.�u
# V��Rg�R*��XX0���(ou�����Z��wh�s�La v��s3��Lw7�1 �o��X&����10��"~&m-����f,-f�+u3���,o�;��v����Y�f^����z
�5�H�e�N��)���p���T,p
g�B_c����I=��[�*�*�x�5��L��3�AN�J�kC��+Rn���wH��W|!P_����!�m�J�J�{g�j/��������-�K�Lu)�Ki+��]�%�!ni��yG��1bG��-U[��K�>G�.��
l�F������\��E0<�h�Z��HmQ�VD=�BR�x�m2�"�X�8�2����>����@/Y���i��c5R�����.t����V���H�8����8-1}|k�x���)�HP��f���-�H�8e��q�
z,�"d����E�����w��q`��xMmc�]�V���j:k�������<0�;l�3����j&����W���c��Z���S��h����[t���J�[M��9�*l(����i�0F�j �����ss��0��^�����"�5TN�������mb�#�q�v��91J� FIy���Q�2x�=b�=� ����5 ���J�������]7�V.�O[?�_���O��a��^��]�7��P\7L�b>F�1;n J#�'�P�GI����!�7�����'�8<��J��@tus �2U���Qa��]#S:�B�o����������4��6�c�^� ��w�Z\[3��r����/?�� $��D�f�0���������J��C�rW/����Q�1�&���g ��L�|\}Oc���*��O��r:�i
�1hz�uuh�gi�� y����I(��C��{���Q%���b%+U��c����ZZ��ePX[m��:�p2`m�m�z2�Y�B���'��k����DM]E��6��4�}h�$�C-8_�0��T�g;����i�>B���-t�{������qi�~$P��Y��>+(;�����*�wP�2�� ���q�*#[�R�l��eNcLC��tg:�O�N�T�����D��=��G[����j�N����E���Z`d B�i*�qU�5�j]00H�j�z����`tc?#-k��������,m�i\�=��G�_��4��V�zU��`}��<��M����� �Rl��� Z9%uO|:�i�Gip���$&�s�)��[�b>I�\gR��d���%h+�������j>(�����(��2/��!�qm�c�L��Z��[)���X�� v��FloK��� ��m���(�[)�BC0*CnMi�>��=L�7���������CS��r/��&�d�6�Fh���N���nMj���A7%��-��'`WQ�%08�HM��\���|� �K�C�u���
�?4�Ms�g�<�����h�O&����A�����.v�*��l>�v�V�0�`�T�~�>>�n�4H�� &vN��0����1�c%I�-�n����%uO����5����1��Y�){� ��� ���S�FE��(Z&�w�d5{yr�����)X����aQ�����6�l5^M�q���,�_n�I�bS� o�k���-�=(�l���qY�d!?����g��qq�V�� �_�N����y���b�%���'b�.�X8�Q3��Gk�a�kw�st���^G��F[��<z
��U����'�%�����Rp�t�Cr�W��y��
v=2K������\�^������f� �?��d������R����������������]�i�r�@O�s*�u�pFl�(�g�v1g�F�e�����6
Gs��K<:��8��>�sQ���\�,�u/����P����MQ��H;GKY;������<�u�����{�$����{��M��xlq������~�|�����[>����-�������v��s>����Hj8�59�����_8/h���;g�hIk���[tw�2��lx�j������3��K)�����)�TmYp��k&>��*��n�T���lz;])�F\�={wv=��>y����4R��������%�a��F�n����������;d��J��7kCS���)j(a�����39����>e�y6;J���,��>l��(z��Q��Nr���5�����af^��~zz9~w��$v�q\&�����f��S~L���e�U���tV~v���+&_������������.`�����yt����1s}�����k2�n������'o��|��W?�?�9����2P���������������0�����+jd�FW�Z��2�x9~}��[�B��)�4~���/�N��������a�����.O�=��
EW��z�_��O)<�U����C]��t��w����_;
��y|v�r�
���h$�d�e4O�/_|x���I?<���><��3�|x���C� SP��/��������x�c��x4!^��OE �'�9���H�@�(L1��U"�h jq��.�$�5�z�
���M����*9*�%��dz��L#'Ydc;a�z>[�[�J����I�m�j#�l<!n��2����n��n ���2]���*)\,������e��8^�/~�X�ZQ����2��M��*:�2i�����C:;�3zy6�5MG����6��6�
���`��x�P7� ����O��������0{�d��=�������E������A���P�����������2�x7�M�l -X��>Y�h������=z[H8���������T���.[����JU>�$�LIIE�M��"A�kTD0�����;sA�8���%M�*L"A����9��>�3�n�����z������������"^7�V�*�������Us
��S]�����yW�����?<|�������������P�#��w.:�q59�Ud��Y��l�b=h��B}������r�r(�iP�/E��/�e��Le���R �����}��7��������������)�.�_c��<_y�xW��W�p��������cu]�\�y��`��}r5�:�~�b8}=x�j������
o�g������W�~^�����YU�d�~V�:�X0��k0�k��uR���h���3J�O���{��`�-V�O$>��m>��#�51��<��R�������� +�'�I�wz��8�Wc7�����}���!��c��K������z����.�������9��i��������J)U�*�.t����t���?t���+�z�c���q5u<�>%5��&YA%��5�qP�q �=;(��z�>�����h|}���Vg���\�B��E=�kp��������1~��A�����$��w��N����j�RR�����<I��r5�V��m�}]�;>%��]�;=I�+���j��������%���og������-�|*���������C�zwy/�J�}6���i�8�c}��U�n�����Ul|��������qk{{����3��vw��������`o����G/��Q��������������������gMN/��I�^����}_�����?_�e�asbV����Z�?��S��Zq�,��^����?�?o�����O1p�������$7����6>~�������u���U�����9�������/���}�m������6
MZj���bj�[O_|^5�5���~�5�����]�r�����P����8kN?�X��x�l��J��������G���ov������4e|xvrlj^���v������<�d����*�Lj\bq)��z���'+�hG�_���k��B����d������dj�]��U���3��|��}��f��<���� �z�G�8.���xR���Q���`A-e��������k�����)�Ljo�=����~����r��3�`�Q����g�pT�>v^� ������.N3g��r}���V�W+�=_��l�kCY��������2��\��7�SsS��o����������=��<�L�L������+^Z�/��TZ�gg�.Go�9Q�p0����O�Z������Y-����`Y�yP�B�)����s�i�+�a�u�c�����(��}�F����>T�����j�9��PM2{_���L�k�����s��<?E�PM9��C����n�^M.���\��xm_nu���z�:������d0:.�`���r:?_����0�]_���} �5��/����vY1#]�b���v ����A�19=���xt[�IM�XC����r���:�'��P��o����l|56(��x:�|�q}�l�H�PX��9��/{�U�n������.�)��]\�V���-��b[�)��]���r�.���:����"tI������'�}�v��+��|���;�5p�w����<N��`���_x-n�b�-od�bX�7�d�&t���q�u�0|Dr�6��I�f�4�N��n ��i��Eb�i���'��~����������{�0w�6�r��Ha�w������)\��+���"p!�[jb_���'���?�j�����U0,>��S����R�x�U�����.����l��S��!t�=�r4��x\��� �~�U���Xq���p��Q�k�x�TpPk8����xX���p:����7�U�0>�5������bg�h�|�y>ZSX����6��:?}�V��������dT��y�_��\WX�����r5��e��7���jqk
�qNk<�n��x���b�M3�Y��������+l�)������t��jMaY��������_�.'������c�����m��zz9��������������<(��9S"Rl�C#�cg������_��{[/_�I^��s��o�����qNWq�=M6�#1"5���G�f�B�|v������mYO&/�hGx�Wm+�z�����=X8��P����W��������)��M�M�u�/#�L���������g������������1��?>�NF-sm�M=�P���}�����
�e(����`�2NE�+�O����Z��z�����JQo����6�s����O�O���_bY��r�a��w��hi�����U���.������zP����?�8P�����q���WW���rP�����Ljv
�Pf'Q<2��J
��:]�������t|r�j���O��Xl��F����2@��������b���%.8��e������'�>������l�0o|�i����>}��������wP�i�.�����V�����y-s_*]_��.T�>������e?��w�����(�w������`�&��_�^��n���[=��������G��jW��S����7���uI\������K]uY�U�0�{����)�������=h~
G��
_/��yt�#�OuO��lK
�g�L��P�(���L�o��^�g>������R�W��K�Z����w7L.;���b=����S��wyav2:`���u�
6��sV��}l�mY���6XIZ~�M)�/�
}-�e���{\M����������yz������p����>_�^�K=$�V����`��W��-aISW���t��UR��#";��bs=TR�����*���������M}��Fn�y�r{���j:����
�'�5�ml�_��r�bB%�����]������Wd����vf�f�9��^��YT��f�k������><��J�N�����qk
�F�&G�0��#n�[���X��V}rv��N \��"`EpPF�|CKF��5]��������DKF�����nA��1'�`=|N��`����@�j���!����DgZ
`P/9#8V`�����K�8������`q�)����"�U]'`d�'F�"�A)�o�$� 2�����Xo2��2�<����L��f:���n��kx������N�%�7�+���I�A0��^L��M�b�=�#�30X&7��#��i?#��xd`d��Q�A��q��@����u��+02���&�d0�Q��D'l@�r���`�����`=N��������^�"��z�%{�� ���#�y���6����AOKN� 7\2�
��&:�2�� ,2#����
�7RD0�� ��Fg��`��d`0�sv�f`0(e0+��o���`t����sF��~�����j1fd0p���Po�`h�0&��I���5�b�[�k����P4���q
.s4��I� ��Q*]�MMX��L�i9�g�D��(���)�VDG:NP��Fmo�M�1-�F.���
U���/�2M�3�F.5�
51r�
��XO�14r���!��Ot(����l�2s�
�����P�)������ ��Mr|��d�������*fh����(�L�'�hCt���n�<_!P�I��$��$.�8�n���(��|�x��ATo��w(����C�&e������r�P����S4pY��S4pY�</;!�K���T���Q�)[7�
P��k�Xr������k�\fnC��S
����9�`�I�y5���L��s
��(Eg@KC[9/�T4�����Sc|���t*���yM���[���NE�����S�\�r�T>s��|�
�eR^6r�5#��L�(�K�P}�EA���G���7u3�!���\���2 �{0�,�>oeB4��\���2������L���6rB.�� ���I�e��8!���� �L4�r�8; ��|�M��4|��
��=��Kq���hnQg�R�{�����\c����6D���tTe�R<�!2r��+
r�z"
r��&
r�U"i����H�F.�l����Hi�K��J\��PtB4��J\�k(�5�6�v��{�8�R�g�8E4]c���4p(�#�����3G#�F=tq���T�\rJ��H"���9)�\j�� ��FG�F.#��E��H�X��f#24r���#�\&jq� �<
B���
E.3�E�����t+E4�;j��sG�Kst/
��6�(piBV��h�t��|�7���Z�b��RC.���ri4�P�4�r����!���r�!�<]R�4�K��1/��e���G.�>(����x�2Q�L<r��:���L#{����@�u��ieB4��P��
��u���o�}������S&]OP���G,�>^�
���WnC�����+�������7nC����[���x�m�}��*����y��������}���F.��,���H�XA�������D�����OT��}|�6�>�G�u�������OXc)����}��u������Op��B�'��5A�'U�u� �cD�'(��Q� ���}�Q5QP� ���}��u������O���>�s/u��&Qu�������H����"�wu��
����DuE�'�x���2�2u��������))�>��^�������DG�E�':�e(�>��NS�}�P�MQ��B#6��O�e(�>Q�����DT�}"O�W�}��YQ���i��Sz�u��������@#��OtEV�}b�����#]�u�������Dm6E�'&j�)�>1Q�MQ����l��O���Q�}b�+2�>�g�(�>��+2�>�����O���u�z�C�����7�>I����O>�P�I��A��l�t����&�LQ�IJ�oE�'������Z���O2�k*�>��.��O�|���<�IP�)�eu����H�}�*�>)��
u��������g�>�d�3
u��LC�'E>�P�IkV{�}R�i���OJ�FF�'��P�I<�VQ�I�*9��OZ�����x�����y����&�QQ��<�GQ��
�8P��kv)�}2��u�������g�u���E��^�����d�+�>�gf(�>�����OV~�u���T����5�V"�F�*��O6P�}�����O6�W)�>�$$E#����+�>��E#��*���O�������}P����>����0V�}���F.y\�P��\6�}2�b�>����>�����ONknV@.3=5c��d~8�P�������O�4��@�qM��81�}\MH�h�h�0�}\�3�
tW�M{tWM)�#��[�OAS[�@�q
��t�z����!�����R��l�\*]�M�K�2����`�\*]�M�K��� ri4�����X&���+� r��U-�\��-S���Y���WdE.����e�7�(r�3GM�K�f�\F4E.#��M�K��m�\&j��!�<���LT{4C.� �!����f�%�P3C.s�=�\f�Q�>��t���S��� �������K�������_)Z�G,�>M�Q��9�#t�z�0�t��r`�F.yf�y��_�b�T��X@.��>�K�3l�4���\��, �F#M�KO���K��|@.=��r���e��e�=��@U~��%��,"��z�������e�vUD.#��"r����L4�h����%�2��K�e���XB.3�3��\f�j��SM�_��S�T-7�}\}�EGD��t'�{��,�;��4_}@�q"����z�=�OAS]�2r���XF.�s��K��d����e�r�N��K�W6�����o�K���|�\r}�7�%�}�\z~a�\��r� ��f
��k��k����Q�~W�G�������K~��G�G�E�u�z�=E#��F<�>�O�{�}��;�Q����Q����Q�YsO�G���g�hA4��<�>Uc�h���)�#�_����:�jz�}T���Q�Q�E�Q�Q�5A�g��D�����.u5��������u�tO���h����_���;�I���W�����3��� �rv�%k-Qj����������5Er��/c{~����@V!�JYcw[d)*��� �sU�8�1B?����������o#;�}L��J��� ����������s�q���5��>�u����>�p�c���^�}l�L���C=%�[�k,�>�>3u��Xa����Z�>�s�}p ^�O8���o�~s�����:�����^&\�P=��8���8������u��n��>8y���}p�]��s��V��s�������}�9�}ph��w��u����8a,����i�����c8�}����8�qu?��s*�)�Z��8�}���7�>N�9����������>�# �>�~f�q�����y�}\��9�}���p�����s'�>�����Z9���Y9���o�:�}��z;v��@}��q���<�>�j��.���t��J��>P�O�8���}����>��7��l�n�>8=� ���.��s����}@����{����k��@��2p�ua����<8��Pmi���)?p�����s���u���OL�>������O�>�����}��r���W�q8���:��>^�������� �>����s_�O��xS�� �����+p�����s_�'�������������s��;����p8������sU
��w��s�1
8��u���7��s����s��{^���U������Sd�����p��C�>7p��c�g����o� �>>�GW�}|�+1p���]��s�P����O���}B��p���n���t;n-����n�e�B��������2�Zc9� �^O8� ��f�s�`����'�z���'�c� �>�N�s� ����+��kY�����w_��'�z��OpUR ��W]��>�~�8� B�7�}Tw��s� B���^H7����8� u���O��'�}���s�F%p����s� �����X����O����}B�S3p�b}���Ol�;��O�{�}bS�/���zN��'����s����8��JH7p��|�s���|8�������.�k����D]M���'������'����s���j��4����s�h�T�s�mu��9������s�h����s�X?_�9��_z�}b��y�}���yy�}�����S9��O��/=�>�������z$ ��O���z�}���������^Uk�e�7���'��z�s�CuO�s�C�Cx�}b��{�}b�����'�*?�����?S��O�U����M���g���|�3��2Tk,�>h]�c����mM��ZWw<�>h]=s���M��g�G7u
��A��-So����n���{B��kY�W�-����Z�����\KS���r-�'���Z��i,o��VH7���c��Z
����UO�z���G���kY�Z���ts-�����ZB=���ZBuO�;����Z��k����e�?�����o\�P�� ����2�������7p-��o����������^���h\�U���ZW�i�q���K��Z��;���t)�j��T�^����u}����V���g�G��'r��O
�T���kY�5��k�����������kY�1�����i�ki������GD��ki��I�Z����������kY���#����G����w"��U�`�����_�����}��k �uC�ZB����\K��w"�����G�e��]h���%��Z��z'4\�z,��p-��OC�����
�2T�;��Z���;4\�X���kY���e���>Z����ir_����:����uu���i��Z;n]��}��:���h]����)�U����g���Z�O�����oAs-M�B�\KS=�4��Tw���Z���r�\K[�]�ki���As-muw9h�����s�����G�z�9���XJ�s�����s���gs-AH7��ts-����>�Ww��>�Ww��>��;p�#D����P�����������]�������������
���X����X�W�}t���}L=�K��'E��Zkn]/9�I>���L����j��u}���G�y8�1�x����{
���z�������
���zT����1���sc���s�p������������S�sc����sS��8�1u�D�sc��"p�c�u�>���8�1�����W�}����j4p�c��������[����28�1�(��s���u�}L���}L��D���������NL�Zr�cB��z�����l�s����sS����a���G�8��������m�����6B���V�9�>��(p�cU}~����{���X]�!8�����"�>VW{���������XSMw����=�E�}l��R����*����X[�>�s[�����X[��sk������������XW��"�>�U���s[__F�}l=T������G�}l�4m���Bu^9���:������w���������s[�p9����!"�>6T���s[__F�}l���#�>�������X=�9���z
8r�c����s����s�Tw�"�>����}\S����`wRm�����'������>�����������M"r����L#�>NW�G�}���[���8S%"�sg�-�sW�O9�q�J�"�>�NX#�>N9�q��9�q��Wq����i#�>�~�5r�����p���J�#�>��k9�q�z
!r��|}������8����{�����zl������sW��9�q�z�(r����2r��b�qD�}��������Z�8���Z*r�u���s�{G���w
#�>P?�9�]o����;"�>`���>P?�9�So���@��B����2"�>`�{�sp�9�> �I8��bG�}�1r�u*9������� T��D�}��^&\��'����}�F�}�W�@E�}���K�} �k,�>�5�s��o�s��q���@��}8��������*%�����)�t���W'�)�7��h������I!��9&����zx�v��1� O��&�N��X�&� �����mq�:�
��ify��-�[���[��A)�6���:���V���m���TG��-�zS��p[�;���J�cij��cKe��c��J[��>���n)^��V�
������orD�������[ZLmE�|nZ�-�r������&�r_�E�B�=��������[�������gf+�������[h��\I��Pz%�P��\I��Pz%��1����n!�e���nA�6o%��
���nA��\I��)��n��+��)��n&��V�-������?�����x�D��Y������}�usf6^8Q���|j+�����D� �YN�-y�����Twf+��N�OmE�|� ��s@�-�4��[�w<�uy��n!����E=�{��[�3�@�-y���J��&�� �6�u>��J���6+/�����n��T/����m%����V�-�\�^�-�S^S[I�hr_�E�L�K����}Iu�y�D������Q�D�l��us:?W����rus~V��4���n���A�
��(����VS[Q�Du���n>kE������[�u��[0�V�-y��������S[Q7l�3[Q���X���uli-���q[�4y|SM)\������c��:�)�������R��6yD����ul���c����M<f[
���2� ���'[I7�[�t3�:��c��� ��m^��0)m� �������\I7�J���m�W��Qz;��cK�� ��-��L:�Ae[I7�kS�&[bx`��u�\I7�6�&���kS�&��6U`���\f`��
�]t�Ii�v�g��n)��V�
Z�V�-z��[l�W�-��+�=�WA������������
�����d�tl��w�Ii���Zu�I���d�tlI�0����`R��[�3[A��3�J�ijC`��
��%��i����[�}0��tk�C`R�Z��L:��Ov�I���}�tl[e&��Ze&��Ze&��<�_H�%_�3[I7�\u�I���E�tl���� ����"�L:�y��:��cKs�0��Fz��[h�W�-�|�L:�y��:��c�`R�cR`����c�&��iU���i(o�n����L:�T';��cK}u������0��R_�&��~���n�q:���m�-`����� ��-���L:���&�F{�L:�&��0��Rz;��cK��0)m-�!;��cK�� ��m+��n��o�����^I7g�$���0���|�LJ[��W�$]���/�����x�����x���?�/�>3s-����������-����[0�\I� ��$^�CnoZ�%��CZ�%��CZ�%:����x��y�AK������x�8v��V���H�3[A7�bu�l�����K�B �
����/I��l+��<of[A�4e[I7������n���t3�Gi���}Z�%��Y������x ��mJ������x��y��%^bl^�h��G�W�%��x!��np�l%��������s�Z�%\no/1@��$^b<���K��>J�%���x��4H��Ei�t��!-�COK����/1�1K�%&�5��x�i�Y/�fL�t�Mf�Z�%��si���&�i��Kl�����T�nVe��%^bUf�Z�%V7tQ����/�%^bu>�%^b������[�o2��t3T�/��R�I�%O�3[I7:��%^bQ�l+�f)�/�t�XK���|�WK���/��R9H��9-�KlPK���/����%^b!��[I7���+���O�x�%6�%^bC����K�w�g��n�qH�%6�\C�%6�X(��"-�3��/q
�7%^�ZoJ������8�}Z�%N�x�ST�/q�U��nNQ_"��5�3ts:���K���w��K�k3/qt�H����.��K���7I7�(o�n�R�$�l���K�k3/qtN�H�����6/q��(H���3/q`(o�n����V�
"�M���9��x����&�����H�����jH�%_�3[I�����n����K\����x����������$�b��
��=0/�&�#�h2�2/EK����� �[4/:m$^:�M��K@S;�x hj�/��$�h-b$^&����K�d�d$^6����K����x 3/�)
�n6R$�pb�m%���4H��<?3/G�W�%��$^��F�%@�#������K�7�7I�wf+��i.'��4�K����x .�r9H�B��2/���I��@���H�h=d$^1�_#����A��7��W��74��x .�s�I��+�%^��I��x�����x��1/���/��P9�y:/g$^�M��A���4��x��-��x�y��x�7����xKse��xKse��xKsO��x:�m$^���o$^�� $^���4H�9Z�H������x�j�/�@�A�%���x��T%^�=��/��?d$^�i�H���Vz%�����J���<��/�t��J���9+���
+����J�$4��[�����KB����x N5�]p����(+�����t��c%^t7��K����J�$��+����N��Kq9+��@\�J�$���I7��{ �f���J�$XjC/ t��J�$�����K�S�I7���Z����zV�%-� V�%��K�$ �W�%���xI��i��K��� ��@w���K�w�^I7�;c%^BC��t��+�,�W�-P��xI���V�%8��v!�����J�$6��Y����P�bC~�$^��O`%^�<��xI��!+����z�J�$*�M�%Q��+����+��Hg��K��{V�%�P��xI�;�V�%�P��xI�;�V�%��NV�%��NV�%���Z��DGmH�%�Q�xIt�OJ�$:�'%^!��/�@�A�%8U��J���/��v�/�����K��y��K��|�J�$zjo/�AS$��7���@�T������xI��9Z��D�f%^���x �ix �� ���1@�%hK�T�������/A[�� �m�
�m3��/����T���-�x�m�i�/�u9��^����x �:J����>U�%��
�A��fFj^��y��^���#L�%�q�L�x �ZJ���J����v/�
����K�6��N�%h�)
�n��u'��
y�w/A[�X�%h��B'����X'���sO'����7I���'��t9k!�����x ���
��|/8���-��x ����x ����x�Ut�� �m�N
�m��� �m�A�M�oM'��5�I7�9�x �f�^b��������l� �m�s'����7I7�)
�nt�� �m��x�U����K�6����K�����K���^���6^��4�
��*��q/A��� �m��&���w��K�
��W�-d��^��y-�^bU�kS'��j�� �F~���K�6P�4�7u/A[�O+�����^���� ���I���-��x �R�x �FJ���&?�N�%��O�K^��T'^b�i�V����%��*I7C�>��XM����KRH���^b5��s/A[�w^bSX�l+��Hc��XM�q��K����t/A[Mi�t���t/A���t/I�]����7�I��ZD�%��|RK��R'����'�����K4��q/����/Iu���[�y��Kt����KLk-"���x1�
�:��$^b�j�xI����x����N�%��" �C�z@�%�IM�t3���$�V���Ou�����V���~$^b�\H���� $^bl�O��KZ� @�%�R[��ts�NH��8Gi�t�{> �C�HA�%�����K���l+���H�����$^b|^��K��$^b|�4��B6k!�\<�S/i�G����3���Kp�5�xId������U �Kg�@�%��&��K��x/���$^b) H���< ��X��/�NS9H���A�x�uy /�`�V�
��K���<���X��/���\I�@e&�K��A�%6R%��|$^2��=��x�k(���K����8����n��<��K���_�x�3�� ����/q��)��h�x�s4J��Af� ����8�sp�x .�s��x�#� �\�R�$��5$^���
/q��)�\Td[��@�mJ���H��u7$^��@�%8�#[A7��P�"A7��� ��}�x �����/I��f�/�%k!� ?� � _2 � v/�N,k,�h�;/�N��&�F~8A�%@��A�%@q'A�%@�P@�%(,���ty$^�Ey�t��V�%���WK�b+
�n^5�A7������x��=a/�����K������/�o�K������%^����x�'�%^���x��x���$��N��x��y��%^�m��{��x��/��2k�/��'�%^�)���x���F^�%4�W����z��x:��%^��A�%��c��K<�$�/�1�3x��x���%^����
�:��%^Zq%^��tM>3�%^��+�%^(���xI ��%^�U'�u��m�V�-h��$^LC��t3��K�$����xI��3^�%�x��xI��l%�Z���K�k���t�;m^�%8���J��oB/��@�K�$ ��/ ����K��{_^�%�S|S��:;�%^���J����K�$�@i�t�4/�xI�=K/��V[/��H�a��Kb���^�%8���[$�^�%QiJ��[$~�%^�'%^�K�<�H�$�A/��H�^�%��s�^�%���x��D���%^����xI����K�%�$^���xI$��%^�K����K"�/�/����K�$B>��%^����%^�S9��t����K"�E�/���g/�t.�K�$��l��KZw���Kb�~R�%1�n/��y��K"�/�/qMk^-������miM&��(�� �$����B�%hK�M�%�!�}A�%hk������-����F�����4H���J���� ����� �����/A�����K\���^��y|/A�|�?�m)���K\Cg��K�6�'��K�6�s%��.mx �Z��t����K\C����K�6�� ����K��K�6�s%�"�&��ud+�F����K���n/A[K� ����/qJen^���l���/A�H�tS.s� ���l;��)Pd+�F{�A�%h��V��7d+�F1���K��V���gx�S�=��K�����[d+�F~-��K��x��t��Yx ��x!�\����A�%�u�$�m��,��i��A�%hK�M�%hK�M�%N��g/A[�� �m=�
�i��CA�%hK� �m�z(�������8,��t#*A�%hK�M�%N��Px �R{x �R{x �R{x�����A�%N�� ���%^�t��!�mi^-��%�^�Z�!��K���7����=� ����l�Z1��KRX�\/A�|�-���@��V��(G��n�����8��y� ���d+�f��Gx�KA5�����w'��K\���m%��9��8C��A�%�\��t��FA�%iV�m%��9��:o$^b����Kd�Q�%��z��t#��Q�%�|F��$���V����%^bB/��KL��|�x� �l%��N[�x����Q�%��ZF��$gH�V�-9����n���N�x�U
=W��*C��n���Q�%��y�t�G7J���O�(����>J���<��/I�������/�t�9J�����(��t�;�J��H�+�F~���K,��/�t.3J�����(��t�+�J������Kl�,(J�����(��tH �J��H��n[)J��RL�(�Kq*��KZ�E��K���/i���/�Eo�X�%�x�mH�%�b9D����6k!�G����K��=��x .d+�f��4J���.J�E&[I7�1%^�\fAQ�%��OF���cs=�x�/$^� �q��K���a��K�_D��8���/q!s�(���H����>�����D��8��%^�"����V�'�hh��x 44nJ���t�x �|�"J��C�\A7P4nJ�(�N�x ���(��y�2J�4�S%^F�����(�04��x �^s�x �^s�x ��(��3�Q�%8���I��;/�~G�% ��G�� �;E�����H��K��F�x �o�(���4Q�%@g}��K *_I7b�Q�%3kK�(���f^���j������lM�����]�^����������]�*[N��Gg����������v���~���_l�����W�����U�>�����x�����������5_������������������������jp|z5<]\��{^
�<{�������a~�J���w���jx>:���o]��b�Ge����������O���������F>�_]<:=���=�~x5|tvqtx6|4<?�����^n��n>��}tyq=zw5��R�����WL�S+&�s�*���t4��15�O_6�K��������|��������a�����}5`��F����')��������O������� `��:������������_����T��~�<������R��<��^_�8�r}z�f���v�5���lm��}��������x������� ����/��������o��'�N�~��O��x~py8z�q��O�W��>�?:�>:~4I��?R 5����H��u��C�3�J��e�H9��������Y��5
GG����OGG�bnN����>\7v�?��z�{{R����?�����LK�xV��*������w��+`�<;4s{�yyh��SE�[I�7�[*��������q�zs��9�
ap5<<?���n2�����l�r���l8�*R �G�k���1��{3��r���3�+Zk�8�\��a�|q�S�z��[\7B������%�EH,~���=K/��B~�x6:�N}��I�z����>���LrLz�q4��~��? G������G8cz������6��:x���������5\3�kS>)]�XP�\�>eo�
u�U3��O��[m���|x�3����=�|q������w������I�[�������XP�3[-�s���TK��\�
��r�6���������q>����M��1'G -Q���W�����������*}��*d
���[&}e�S�r�{p������2m{���^c%�Y�����-������1����Gl��8�=�)������������>+��=+���#��������1��]�{��u�Y��������L'�;���Y�����v^�k:�������qw��|~vpyz���������f�O�p�nq���<�����P~�x����+���W8,�����"=�����>\�;8�o�qJ�o�_/�;{/�m�m>y�3�����������Ov���fkw���oO��|=�������]�EN>����>=�}��;��P�f��|J�j��t.��?a������vc�w���
L����?m|�������p��v�z����w�
���{��r��-���������������0�`���O7�%��v^nn�g�W�G���>~�z���W�����=�e��d������O�}�*W�������q��R�p��F����A�@��hp��&�n��q�����W}����n���$V���������8�w&���������z������S����t�����O�g���#��?\n���(���?��[`w[YRWB���[?p=�����V���|v�����I�18C�
�PW���_w^m>y�������PjQ��fy�z��������0��=�\��Q��~0�q
���T������������N��^}b=�z�>k��K��LM�fz-{�t���s����4r�7�)c�L��q6'���|tuq�>i&�L����N�Q?���Q_�ns��=)}��-�rtq~�Rx�����p+��*
���\P��A�n=~��ds�7<P/�����
W��W����+s�?�F��N�������0�O��+mt[v�b��a.*�����o�YF�\��.�M��nB� '�_����N�C��q�'�����Lm�R$����&.n�^�N��}��l��A�.}2�}�q2�W7\�v�� �.&�z�NA-�� J`�Hl�� �b��Uv����W�[7P��*0�q������"}����=y�S�.c%x�z��������������6�'��>=?���m��Vq��I�/&��Y��|��Xt�\P�N���2F��������W�����6���� ���8&����q�`~����������������/�����}�_L�d��Qf5�J%��HH��>G*Q��|�4�/QG[W6`�/j����\OJ��r�������fv���fwk2,~����nj���3��������o!m�T ~}��J��`g����yJ�f��:�N<�Mq�)�d�:�4IH:�tq���R^�����R���Q
\m�������*�}9�:h��W�*���Xi���_hlcf�yI7k���%��+�fI7�I�^�<�a���t��!>��!������v����;==�RS���}8=}�|t���pt��'nZ9U:R��//������.�<��3_�=;���6
��B�]o|�����!�\}�N��v3 M���S�i��������3�9����=+E�>�����~yn��Vm�C�d����P��sl������~�&����2g�<������g�{�0���g��s5J���5���[���edu+]�����e*C�����e�J���� ��� ��4���������&��C��&��#D���]}V9�����vHs�*�Nr�2���x��L:��s��w������a:��X�oeg�y�v��������� <�$q�nQg�����i��kN�&�pd)�*.��_w�Rq���=����{�e�����x��t���^���j����L�o�����Rz����:�:��?�|oF�
f^O�R�O��`����$M����7��S'���t\��%��)X|x������E��+����������j����*��[V:���U"X����� �����L�$��������W;�\>�����1�b�w3���EI��am6�����a��=���'m��%]}��k\�=�����^����������qTL�G�F���c��%����bE��^e���ul_�<���!����D�q9$�nsK�G����w��<�%,��Vw��m�s`?M�����
�|7( `��p���
)��� `�g b��������� ��<����z�^/�t�Z}B�5��w��g��G���
�p����p�|nq����G� A~R��!_;>6z��[�9����o�#�����_8�71r>��S�1��<��s�����S,�����{����pt�V�?���i����������G������������_g��}W�c~�����i��tO}`�B��>���H
A���PW!
�_s ��N�W����j�}�a8xyx9.�����;��=,��{����� �s�����B*���k/��Y���J��t�kB
4{`�ay��������7���gi\<�Z��
ymW�W)c�.�.�P���>����������?����`og��`��;{����<}�5��z����+\���cw�����6�� ����<�?��_&��������`�����������+Q�[��~�� W[�;{������>�4>������������kL{���>�����~x����{F����������`�O�w�����<����\�����Z�L3:�_�ON�O����b�=��*����������6% s���!�2m+�+�!�Z���v������&%;bQ����C����_/�`���h��a���*S��zw*�7X���K���}�:<?����=������c��kg��R��g�������c�\�����F95�>93�T���P�Z
.���r�O�H����Z�O_��M�*�?�
`
[y�m����S��5������_�����f����]?���N�WKmAR[:���>?<;9���T��<�>}B����:�����Ue�/�I��r���,�%9���l��>Eu����N�3�^^��ag7��[�_�x1�w5K�i'1I$+���[�����q�9yKy�W��
�����{�c��}�����v�o�M�����������}�agn�^������y
F��3��Nu�{�d��g����"|c�7������3���7���|���Fo(p)�Lp�`��FL.�S���s���������xv�� &��l���p�6s�e����%�Q����e�n�E���y���A�w�����m�����7����p��h�YH�?ib����i�������/r�����Q� ���Y����+Gz�?��������{�|N<&�������8��}�O�0WK�
���|��K���A��^�)�0���#�uM���Q������<���x����a�W���9���s������?����v7v�v�^�l��>�gMc�htq5����;<��xd��w�ooR�K:s2���p:5na��������M��kV�(�&lYQl�LEIQ*n~b���V�6�O���W�o���Nu.�_��f6�-3W
�����DP�VS�W���5YY9�VS�kQ�������=
nr�[hm����(J�6���5X���EI�F�0�(��|}8+V�t��!,�K��b����}x��-��+,\ei�o�R�[��?=~����S\�=���
^�������T
e�Rw����z2$���T��w;�go����w2Hk�i��|x:�);&/��iT��c�5mL�������=I���������i�������W����I]���'[�{_cbp�7V�w{�6���!�"��i�����9�=�o���SD��_�H#`�:E.����t���������[��\�e ���E��;)�qy��))��]�]T�\����b�f������[}$�&���oN��U��s`2���E��������n�����>[]EN�e;r���"�Z���|���/�t&u���H{���&���`��3<� ����Q8�r=���mQXR��d��j�����.��[��EV�����TP���E��p�J��7�y�"��T�Q���L�k7E+�G�k�}�=������|9����%���;_*�M��jCA�lo�
����>��,m������[�3�N��jY�t=i���������}�����m����ns�.����b"���+�c�"�G�E�mr7x+��]����6J'�3K����[[�bs��Z��2)�a�������5���j;��U3���>�j�4�u����l��������v^v����m��cl�k�H�Rt`��F�r�p�:�����kU�_������
s�.I���Og�����[�~����[�Sm>F��u���kZ������{�����V�7�;j�*�$���N�q��i���;
����B�t�}mjdQ�vFV�v��/��'��h�J1�2��<�����)����(D���,����3v�z
���y�e
���G*32`������-���9�����k����H�7�5�[x���4�J��fx|'5�����M�eN�Ob�q���O�o^�T�4�� ��'���2���������Yn����BY� ��t�������rF��Zf��vk��j���������n�q��������Z�����d��b@rO5.LT��g0u����Z�&\HL�m4)$�n����A���&~��9����C���51)�R���_c��N�g������~�z����q�4X�H�PF��zwE���D��nDv�5�W?��ws�o�O����yo8G:|�����+��o����V��_d��*����$Z��(�n���-J y��{�G���j�d8~��~:������
?~>�\��8_�|>8
:��a��?M<��Ei�gN=����d_�>{�n�e|����G�WG�O&�C���Kxq^�|����������3����:����{�M���|��Nrga&���/��KD�X����nA�������>�\y�,9��;��W������b)w�L���O> 9��������g�o�g���OKo����m�
�/��8�^����&J����sjx�������"�n�����N2����<=���o��W������Wk��D>�7��k����\5�C�&�;;7���4�N�W�0�zn��y���g�Q�Sc|@)�O�-�� �� ��������?}��|o��������;������=����j�/_��}0H��5z<�:Gjq����k��]�Ei������
��C�F����D�o(�,|` �|rH�E/������_�a�^
����XH&+1 �����:��9��*�������(�B���f�p���Q/���I1���5��^���t���EO����e��B���?(�������btcR�h�6���G�.-�������9m_�\��b�R��[�6�r����*d�q�\�S�J����w�G8�H�]\��;=?8z?<�����b��PP ,'BI�}c �������iS���fQ3}��������,�V���P���������MNm�{���
�57_Q�j��WT�����[����M������?}�f�!)2'*����d��B�����>fi�}��H�wN,�oi�������{�JHZ�r}#���_s��&�~��?<�&S<=\%c -��:n�O_�x��;j���/L]��+���o������]��)Fs3�_rq�<T����e���XG�L{Q<}mU��������[��*x�gzyS�LR������6������3�%�xl��'��4�g�h�� ���?�����V|�J��o����E�����������xV�����7���L?�?qN�v*
/���fz�oS�8�c��&w��#����q��n���Bi��j��R����~q���b��'O�=s��w���?Y��N�;)n���*�E����������06F�sjk\�U&m�&��79k �_�Fc��sFl��:���������2������'�Bc�g������8g''���r9!��L�[1�e(����@��>��Y�2�mBTk��v��~�D��29�X���mE�����5�����q�:��t���[��/�5����.�/���c
�����cZ��&#Hw7,�V���b�}�Y3b��w�e�0R
\�VXO�{�U��](����B
� 2�����i��~���:��� ���d��
.�uS�.(��b��pr�?D�����)�]����3�>���v���r~�p::]��J������/��,BPG'0\l���\�IO]l����{��V>���L�K�^v��n��x����/������_��y�^�1�t����3�2����6���%��RT�_�5iv� ��N8$���*���;�wW���/�N�����:�f����{l�� }��o�h
��w�YR���fz�(�e?��������>^�NO���}�������_~9�����4����|�?9JU���'[��5��1��������$��G�6&�|�~n��"������F
R��Gz���a?3���������^���<+dj�b�Q���W�������MN'�����>L$]�������I����b����K���m���M�l`:��������7��M�cJ����%�no>�kz�������K�n���%W���7.t�<>5&q��0�D�t ����J���CL#�c�<�i����Bt�r7�i7����dRx��b�f����)�IGG��?;�7)b���qB����(�����>|�pn����:s�|Q�r��kf������^8~&4��w���qaJ���-��������)��'u"E�b���9������iV�b���G���R(�>�y�D�������eEz����/�k�(���`jq&�r���] ����wu9��� x����vU�����I�qv����7�uBh~�40��MQ8W�7Q�������.�U�3��KDLM_�z����,�����r�W-�y���Z��$��$i������^��^�yE�J_�c������40J}LU����W����)^����B0���BE��j5�ik����]>p;g�S��8���L�(�t�M��_�q�w��;O�W*V�.���R�o����:�m�"�&�e�"�E���p�Td0q�u���� j���(ob���H���{(��� �2���s���bP�9�L����M+z���L�P���o��5��A��fQ�\���KqQk���Q�$���PQ�X�VyVf��K-a�J�p� ��E�]�=�o�P��p���,+��A������;�q��G������ �^����F������hptq~>7����b����M�2������y ��M"6}�Tu?&g�����k������j4�z��������:g�&UE��m����R�Q���Y,�1��I���E*M��P�ZZ��������������9�k����yq�`�\AP��.�����[�z�| ��*��*a�{�*o��x:��{��~�� �
I�x{�&,2���f9co�(�b��P�����r-B���QM��Q�����O�Rh_?0��|LJA��}ca�[�f_[�W~��tJ��kO�
��O0�"cj6���z��I�>�d�rH��{ �X|�w��O��cCu��<I��I��:����P��Y��Q�"�v6��3����(h� ��2��/�L5����>��_�%����o��O������(�H^:S.G�+�@��t(3��l��z��"I�8��@g,�8�.��+�I�f[��8}���������U�����"��4��P&�jv$y^���P*EP���+m��9"-�v���gTZ�S��|�9Z�������b� �G��2?���Jo�H{����V��c[$Hk�c�`�N-��
�k�@kP������{MHLQ��;.���[*V����� ��eB�a��~ �soL�h�=BS����}����EQ��Y�qm(b��y�4g[O4�����o�(�R��]�����f�y���o/�WY&&4��t�*`Uw\]���e������W����te���U��
-a��+������]���YG�)
y���-��TgMB�BD�vG�
l3�`K��tPym�0�J����P|�w���8
q�/��V�B%��fA��|%/�B/�9�s�����Pv�3����H�P��E�-���&Z!T�IX�he�|�C���Y�P�e�b�����*�����k������REa��w'�o�\�'V�e*�k���^f���fa���t���XRV����u�\Z��]���E����=-������he�_��vwa���,*&�����n�VX��nW�M���KC��4���k����zG�,6y��\`��.1v�)�T4�������N����j��
w2a���*Y�~Lee5n6IX%�l��y�)�m���8_���z��Z�r��n��eant�U���������*j����Hk@]��\�E`���1�s�����\f|����:�����p9[f��Nma>���</�b
���#�bw�y������x?�{��| t^��
�5�����;�dQ.B:�������U�����n�����p1�
��z�?8�<�8>��/����n�/^c�B���1Rx�������������Q�v���U]�u>0L��%����U�x54j��'�?����\g��o�����?�?y7���O��N�������������m�Z{Q����n� EGj�s�B���n����\��2ez�Fo�c���)31��`�r�p�������{!;A���Q'�+GW+�P�P t��P+�]�n��)p�����J���3\bF�U����La���Z�Z�p�t��]��v�|��C�bY,,�t��]f�ba��������p�c���"�����`k����<�SP|�������h���Mv�K�W���a�P�w�7o0�%p�~f���9����n�kcg.���P�q��f�0�P��V�^�HV����x;/^��=����l��-�`=/��fn���Dk�,hM���0�P]K�Ju�6I����r���Z����yc��5��}���TVnt6����������+P��s����=�[Qd%��PR���b �X��@P���O�.�����8T�,K�3�������pi6M
�b��/��,I��+��6A������8^����d�r45P��e@S����j�&S9)^w<���=M���X�����UF��n*9��;����%�3M���,�LSD �Ekk�:�JmE����3������K�{�w"���o�C-�����4g��m)5�_�sI��x���6��E�{{����tp���C!�|� ����@SQ��u-_Eh��!��Qm�������g�16����#�+���Pp#c �8*�b;���i0� ����lC�l- � �8�r+ds��# ���>��S��<�=����s�>�"