Relation bulk write facility
Several places bypass the buffer manager and use direct smgrextend()
calls to populate a new relation: Index AM build methods, rewriteheap.c
and RelationCopyStorage(). There's fair amount of duplicated code to
WAL-log the pages, calculate checksums, call smgrextend(), and finally
call smgrimmedsync() if needed. The duplication is tedious and
error-prone. For example, if we want to optimize by WAL-logging multiple
pages in one record, that needs to be implemented in each AM separately.
Currently only sorted GiST index build does that but it would be equally
beneficial in all of those places.
And I believe we got the smgrimmedsync() logic slightly wrong in a
number of places [1]/messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi. And it's not great for latency, we could let the
checkpointer do the fsyncing lazily, like Robert mentioned in the same
thread.
The attached patch centralizes that pattern to a new bulk writing
facility, and changes all those AMs to use it. The facility buffers 32
pages and WAL-logs them in record, calculates checksums. You could
imagine a lot of further optimizations, like writing those 32 pages in
one vectored pvwrite() call [2]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, and not skipping the buffer manager
when the relation is small. But the scope of this initial version is
mostly to refactor the existing code.
One new optimization included here is to let the checkpointer do the
fsyncing if possible. That gives a big speedup when e.g. restoring a
schema-only dump with lots of relations.
[1]: /messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi
/messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi
[2]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v1-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v1-0001-Introduce-a-new-bulk-loading-facility.patchDownload+535-306
On 19/09/2023 17:13, Heikki Linnakangas wrote:
The attached patch centralizes that pattern to a new bulk writing
facility, and changes all those AMs to use it.
Here's a new rebased version of the patch.
This includes fixes to the pageinspect regression test. They were
explained in the commit message, but I forgot to include the actual test
changes. That should fix the cfbot failures.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v2-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v2-0001-Introduce-a-new-bulk-loading-facility.patchDownload+543-314
Hi,
On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.
One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.
The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.
I think this would be a huge win for people running their application tests
against postgres.
+ bulkw = bulkw_start_smgr(dst, forkNum, use_wal); + nblocks = smgrnblocks(src, forkNum);for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();- smgrread(src, forkNum, blkno, buf.data); + page = bulkw_alloc_buf(bulkw); + smgrread(src, forkNum, blkno, page);if (!PageIsVerifiedExtended(page, blkno, PIV_LOG_WARNING | PIV_REPORT_STAT)) @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, * page this is, so we have to log the full page including any unused * space. */ - if (use_wal) - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false); - - PageSetChecksumInplace(page, blkno); - - /* - * Now write the page. We say skipFsync = true because there's no - * need for smgr to schedule an fsync for this write; we'll do it - * ourselves below. - */ - smgrextend(dst, forkNum, blkno, buf.data, true); + bulkw_write(bulkw, blkno, page, false);
I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?
+/*------------------------------------------------------------------------- + * + * bulk_write.c + * Efficiently and reliably populate a new relation + * + * The assumption is that no other backends access the relation while we are + * loading it, so we can take some shortcuts. Alternatively, you can use the + * buffer manager as usual, if performance is not critical, but you must not + * mix operations through the buffer manager and the bulk loading interface at + * the same time.
From "Alternatively" onward this is is somewhat confusing.
+ * We bypass the buffer manager to avoid the locking overhead, and call + * smgrextend() directly. A downside is that the pages will need to be + * re-read into shared buffers on first use after the build finishes. That's + * usually a good tradeoff for large relations, and for small relations, the + * overhead isn't very significant compared to creating the relation in the + * first place.
FWIW, I doubt the "isn't very significant" bit is really true.
+ * One tricky point is that because we bypass the buffer manager, we need to + * register the relation for fsyncing at the next checkpoint ourselves, and + * make sure that the relation is correctly fsync by us or the checkpointer + * even if a checkpoint happens concurrently.
"fsync'ed" or such? Otherwise this reads awkwardly for me.
+ * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/storage/smgr/bulk_write.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/xloginsert.h" +#include "access/xlogrecord.h" +#include "storage/bufmgr.h" +#include "storage/bufpage.h" +#include "storage/bulk_write.h" +#include "storage/proc.h" +#include "storage/smgr.h" +#include "utils/rel.h" + +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID + +typedef struct BulkWriteBuffer +{ + Page page; + BlockNumber blkno; + bool page_std; + int16 order; +} BulkWriteBuffer; +
The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?
I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...
+/* + * Bulk writer state for one relation fork. + */ +typedef struct BulkWriteState +{ + /* Information about the target relation we're writing */ + SMgrRelation smgr;
Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?
+ ForkNumber forknum; + bool use_wal; + + /* We keep several pages buffered, and WAL-log them in batches */ + int nbuffered; + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES]; + + /* Current size of the relation */ + BlockNumber pages_written; + + /* The RedoRecPtr at the time that the bulk operation started */ + XLogRecPtr start_RedoRecPtr; + + Page zeropage; /* workspace for filling zeroes */
We really should just have one such page in shared memory somewhere... For WAL
writes as well.
But until then - why do you allocate the page? Seems like we could just use a
static global variable?
+/* + * Write all buffered pages to disk. + */ +static void +bulkw_flush(BulkWriteState *bulkw) +{ + int nbuffered = bulkw->nbuffered; + BulkWriteBuffer *buffers = bulkw->buffers; + + if (nbuffered == 0) + return; + + if (nbuffered > 1) + { + int o; + + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp); + + /* + * Eliminate duplicates, keeping the last write of each block. + * (buffer_cmp uses 'order' as the last sort key) + */
Huh - which use cases would actually cause duplicate writes?
Greetings,
Andres Freund
On 19/11/2023 02:04, Andres Freund wrote:
On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.
Oh I didn't realize they're not counted at the moment.
+ bulkw = bulkw_start_smgr(dst, forkNum, use_wal); + nblocks = smgrnblocks(src, forkNum);for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();- smgrread(src, forkNum, blkno, buf.data); + page = bulkw_alloc_buf(bulkw); + smgrread(src, forkNum, blkno, page);if (!PageIsVerifiedExtended(page, blkno, PIV_LOG_WARNING | PIV_REPORT_STAT)) @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, * page this is, so we have to log the full page including any unused * space. */ - if (use_wal) - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false); - - PageSetChecksumInplace(page, blkno); - - /* - * Now write the page. We say skipFsync = true because there's no - * need for smgr to schedule an fsync for this write; we'll do it - * ourselves below. - */ - smgrextend(dst, forkNum, blkno, buf.data, true); + bulkw_write(bulkw, blkno, page, false);I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?
Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.
However, I'd like to not expose that detail to the callers. I'm
imagining that in the future we might optimize further, by having a
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
opportunistically, if you fill the buffers sequentially, bulk_write.c
could do one smgrextend() to write the whole 1 MB chunk.
I renamed it to bulkw_get_buf() now, and made it return a new
BulkWriteBuffer typedef instead of a plain Page. The point of the new
typedef is to distinguish a buffer returned by bulkw_get_buf() from a
Page or char[BLCKSZ] that you might palloc on your own. That indeed
revealed some latent bugs in gistbuild.c where I had mixed up buffers
returned by bulkw_alloc_buf() and palloc'd buffers.
(The previous version of this patch called a different struct
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
confused!)
I think this helps a little, but I'm still not very happy with it. I'll
give it some more thought after sleeping over it, but in the meanwhile,
I'm all ears for suggestions.
+/*------------------------------------------------------------------------- + * + * bulk_write.c + * Efficiently and reliably populate a new relation + * + * The assumption is that no other backends access the relation while we are + * loading it, so we can take some shortcuts. Alternatively, you can use the + * buffer manager as usual, if performance is not critical, but you must not + * mix operations through the buffer manager and the bulk loading interface at + * the same time.From "Alternatively" onward this is is somewhat confusing.
Rewrote that to just "Do not mix operations through the regular buffer
manager and the bulk loading interface!"
+ * One tricky point is that because we bypass the buffer manager, we need to + * register the relation for fsyncing at the next checkpoint ourselves, and + * make sure that the relation is correctly fsync by us or the checkpointer + * even if a checkpoint happens concurrently."fsync'ed" or such? Otherwise this reads awkwardly for me.
Yep, fixed.
+typedef struct BulkWriteBuffer +{ + Page page; + BlockNumber blkno; + bool page_std; + int16 order; +} BulkWriteBuffer; +The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...
I renamed this to PendingWrite, and the field that holds these
"pending_writes". Think of it as a queue of writes that haven't been
performed yet.
+/* + * Bulk writer state for one relation fork. + */ +typedef struct BulkWriteState +{ + /* Information about the target relation we're writing */ + SMgrRelation smgr;Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?
Good point. I just added a FIXME comment to remind about that, hoping
that that patch gets merged soon. If not, I'll come up with a different fix.
+ ForkNumber forknum; + bool use_wal; + + /* We keep several pages buffered, and WAL-log them in batches */ + int nbuffered; + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES]; + + /* Current size of the relation */ + BlockNumber pages_written; + + /* The RedoRecPtr at the time that the bulk operation started */ + XLogRecPtr start_RedoRecPtr; + + Page zeropage; /* workspace for filling zeroes */We really should just have one such page in shared memory somewhere... For WAL
writes as well.But until then - why do you allocate the page? Seems like we could just use a
static global variable?
I made it a static global variable for now. (The palloc way was copied
over from nbtsort.c)
+/* + * Write all buffered pages to disk. + */ +static void +bulkw_flush(BulkWriteState *bulkw) +{ + int nbuffered = bulkw->nbuffered; + BulkWriteBuffer *buffers = bulkw->buffers; + + if (nbuffered == 0) + return; + + if (nbuffered > 1) + { + int o; + + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp); + + /* + * Eliminate duplicates, keeping the last write of each block. + * (buffer_cmp uses 'order' as the last sort key) + */Huh - which use cases would actually cause duplicate writes?
Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
placeholder and come back to fill them in later, but now that
bulk_write.c handles that,
Removed that, and replaced it with with an assertion in buffer_cmp()
that there are no duplicates.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v3-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v3-0001-Introduce-a-new-bulk-loading-facility.patchDownload+556-338
Melanie just reminded about an older thread about this same thing:
/messages/by-id/CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax+ckjig@mail.gmail.com.
I had completely forgotten about that.
Melanie's patches in that thread implemented the same optimization of
avoiding the fsync() if no checkpoint has happened during the index
build. My patch here also implements batching the WAL records of
multiple blocks, which was not part of those older patches. OTOH, those
patches included an additional optimization of not bypassing the shared
buffer cache if the index is small. That seems sensible too.
In this new patch, I subconsciously implemented an API close to what I
suggested at the end of that old thread.
So I'd like to continue this effort based on this new patch. We can add
the bypass-buffer-cache optimization later on top of this. With the new
API that this introduces, it should be an isolated change to the
implementation, with no changes required to the callers.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 19/11/2023 02:04, Andres Freund wrote:
On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.Oh I didn't realize they're not counted at the moment.
+ bulkw = bulkw_start_smgr(dst, forkNum, use_wal); + nblocks = smgrnblocks(src, forkNum);for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();- smgrread(src, forkNum, blkno, buf.data); + page = bulkw_alloc_buf(bulkw); + smgrread(src, forkNum, blkno, page);if (!PageIsVerifiedExtended(page, blkno, PIV_LOG_WARNING | PIV_REPORT_STAT)) @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, * page this is, so we have to log the full page including any unused * space. */ - if (use_wal) - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false); - - PageSetChecksumInplace(page, blkno); - - /* - * Now write the page. We say skipFsync = true because there's no - * need for smgr to schedule an fsync for this write; we'll do it - * ourselves below. - */ - smgrextend(dst, forkNum, blkno, buf.data, true); + bulkw_write(bulkw, blkno, page, false);I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.However, I'd like to not expose that detail to the callers. I'm
imagining that in the future we might optimize further, by having a
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
opportunistically, if you fill the buffers sequentially, bulk_write.c
could do one smgrextend() to write the whole 1 MB chunk.I renamed it to bulkw_get_buf() now, and made it return a new
BulkWriteBuffer typedef instead of a plain Page. The point of the new
typedef is to distinguish a buffer returned by bulkw_get_buf() from a
Page or char[BLCKSZ] that you might palloc on your own. That indeed
revealed some latent bugs in gistbuild.c where I had mixed up buffers
returned by bulkw_alloc_buf() and palloc'd buffers.(The previous version of this patch called a different struct
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
confused!)I think this helps a little, but I'm still not very happy with it. I'll
give it some more thought after sleeping over it, but in the meanwhile,
I'm all ears for suggestions.+/*------------------------------------------------------------------------- + * + * bulk_write.c + * Efficiently and reliably populate a new relation + * + * The assumption is that no other backends access the relation while we are + * loading it, so we can take some shortcuts. Alternatively, you can use the + * buffer manager as usual, if performance is not critical, but you must not + * mix operations through the buffer manager and the bulk loading interface at + * the same time.From "Alternatively" onward this is is somewhat confusing.
Rewrote that to just "Do not mix operations through the regular buffer
manager and the bulk loading interface!"+ * One tricky point is that because we bypass the buffer manager, we need to + * register the relation for fsyncing at the next checkpoint ourselves, and + * make sure that the relation is correctly fsync by us or the checkpointer + * even if a checkpoint happens concurrently."fsync'ed" or such? Otherwise this reads awkwardly for me.
Yep, fixed.
+typedef struct BulkWriteBuffer +{ + Page page; + BlockNumber blkno; + bool page_std; + int16 order; +} BulkWriteBuffer; +The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...I renamed this to PendingWrite, and the field that holds these
"pending_writes". Think of it as a queue of writes that haven't been
performed yet.+/* + * Bulk writer state for one relation fork. + */ +typedef struct BulkWriteState +{ + /* Information about the target relation we're writing */ + SMgrRelation smgr;Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?Good point. I just added a FIXME comment to remind about that, hoping
that that patch gets merged soon. If not, I'll come up with a different fix.+ ForkNumber forknum; + bool use_wal; + + /* We keep several pages buffered, and WAL-log them in batches */ + int nbuffered; + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES]; + + /* Current size of the relation */ + BlockNumber pages_written; + + /* The RedoRecPtr at the time that the bulk operation started */ + XLogRecPtr start_RedoRecPtr; + + Page zeropage; /* workspace for filling zeroes */We really should just have one such page in shared memory somewhere... For WAL
writes as well.But until then - why do you allocate the page? Seems like we could just use a
static global variable?I made it a static global variable for now. (The palloc way was copied
over from nbtsort.c)+/* + * Write all buffered pages to disk. + */ +static void +bulkw_flush(BulkWriteState *bulkw) +{ + int nbuffered = bulkw->nbuffered; + BulkWriteBuffer *buffers = bulkw->buffers; + + if (nbuffered == 0) + return; + + if (nbuffered > 1) + { + int o; + + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp); + + /* + * Eliminate duplicates, keeping the last write of each block. + * (buffer_cmp uses 'order' as the last sort key) + */Huh - which use cases would actually cause duplicate writes?
Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
placeholder and come back to fill them in later, but now that
bulk_write.c handles that,Removed that, and replaced it with with an assertion in buffer_cmp()
that there are no duplicates.
There are few compilation errors reported by CFBot at [1]https://cirrus-ci.com/task/5299954164432896, patch needs
to be rebased:
[02:38:12.675] In file included from ../../../../src/include/postgres.h:45,
[02:38:12.675] from nbtsort.c:41:
[02:38:12.675] nbtsort.c: In function ‘_bt_load’:
[02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member
named ‘btps_page’
[02:38:12.675] 1309 | Assert(dstate->maxpostingsize <=
BTMaxItemSize(state->btps_page) &&
[02:38:12.675] | ^~
[02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition
of macro ‘Assert’
[02:38:12.675] 864 | if (!(condition)) \
[02:38:12.675] | ^~~~~~~~~
[02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion
of macro ‘TYPEALIGN_DOWN’
[02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN)
TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
[02:38:12.675] | ^~~~~~~~~~~~~~
[02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in
expansion of macro ‘MAXALIGN_DOWN’
[02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \
[1]: https://cirrus-ci.com/task/5299954164432896
Regards,
Vignesh
On 09/01/2024 08:50, vignesh C wrote:
There are few compilation errors reported by CFBot at [1], patch needs
to be rebased:
Here you go.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v4-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v4-0001-Introduce-a-new-bulk-loading-facility.patchDownload+553-340
On Fri, Nov 24, 2023 at 10:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.
I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.
--
Robert Haas
EDB: http://www.enterprisedb.com
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4575/, but it seems
there was a CFbot test failure last time it was run [2]https://cirrus-ci.com/task/4990764426461184. Please have a
look and post an updated version if necessary.
======
[1]: https://commitfest.postgresql.org/46/4575/
[2]: https://cirrus-ci.com/task/4990764426461184
Kind Regards,
Peter Smith.
On 10/01/2024 18:17, Robert Haas wrote:
I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.
Renamed the 'bulkw' variables to 'bulkstate, and the functions to have
smgr_bulk_* prefix.
I was tempted to use just bulk_* as the prefix, but I'm afraid e.g.
bulk_write() is too generic.
On 22/01/2024 07:50, Peter Smith wrote:
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.
Fixed the headerscheck failure by adding appropriate #includes.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v5-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v5-0001-Introduce-a-new-bulk-loading-facility.patchDownload+558-340
Committed this. Thanks everyone!
--
Heikki Linnakangas
Neon (https://neon.tech)
On 23/02/2024 16:27, Heikki Linnakangas wrote:
Committed this. Thanks everyone!
Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..
--
Heikki Linnakangas
Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..
Those are mine, let me know if you need local investigation.
regards, tom lane
On 23/02/2024 17:15, Tom Lane wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..
Those are mine, let me know if you need local investigation.
Thanks, the error message was clear enough:
bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
} BulkWriteState;
^
../../../../src/include/storage/bulk_write.h:20:31: note: previous definition is here
typedef struct BulkWriteState BulkWriteState;
^
1 error generated.
Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI
caught that. I also tried to reproduce it locally by adding
-Wtypedef-redefinition, but my version of clang didn't produce any
warnings. Are there any extra compiler options on those animals or
something?
--
Heikki Linnakangas
Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Thanks, the error message was clear enough:
bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
} BulkWriteState;
Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI
caught that. I also tried to reproduce it locally by adding
-Wtypedef-redefinition, but my version of clang didn't produce any
warnings. Are there any extra compiler options on those animals or
something?
They use Apple's standard compiler (clang 15 or so), but
'CC' => 'ccache clang -std=gnu99',
so maybe the -std has something to do with it. I installed that
(or -std=gnu90 as appropriate to branch) on most of my build
setups back when we started the C99 push.
regards, tom lane
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
Committed this. Thanks everyone!
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608
with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
#9 0x107baf24 in _bt_blwritepage (wstate=0x100d0a14 <datum_image_eq@AF65_7+404>, buf=0x304f13b0, blkno=807631240) at nbtsort.c:638
#10 0x107bccd8 in _bt_buildadd (wstate=0x104c9384 <smgr_bulk_start_rel+132>, state=0x304eb190, itup=0xe10, truncextra=805686672) at nbtsort.c:984
#11 0x107bc86c in _bt_sort_dedup_finish_pending (wstate=0x3b6, state=0x19, dstate=0x3) at nbtsort.c:1036
#12 0x107bc188 in _bt_load (wstate=0x10, btspool=0x0, btspool2=0x0) at nbtsort.c:1331
#13 0x107bd4ec in _bt_leafbuild (btspool=0x101589fc <ProcessInvalidationMessages+188>, btspool2=0x0) at nbtsort.c:571
#14 0x107be028 in btbuild (heap=0x304d2a00, index=0x4e1f, indexInfo=0x3) at nbtsort.c:329
#15 0x1013538c in index_build (heapRelation=0x2, indexRelation=0x10bdc518 <getopt_long+2464664>, indexInfo=0x2, isreindex=10, parallel=false) at index.c:3047
#16 0x101389e0 in index_create (heapRelation=0x1001aac0 <palloc+192>, indexRelationName=0x20 <error: Cannot access memory at address 0x20>, indexRelationId=804393376, parentIndexRelid=805686672,
parentConstraintId=268544704, relFileNumber=805309688, indexInfo=0x3009a328, indexColNames=0x30237a20, accessMethodId=403, tableSpaceId=0, collationIds=0x304d29d8, opclassIds=0x304d29f8,
opclassOptions=0x304d2a18, coloptions=0x304d2a38, reloptions=0, flags=0, constr_flags=0, allow_system_table_mods=false, is_internal=false, constraintId=0x2ff211b4) at index.c:1260
#17 0x1050342c in DefineIndex (tableId=19994, stmt=0x2ff21370, indexRelationId=0, parentIndexId=0, parentConstraintId=0, total_parts=0, is_alter_table=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at indexcmds.c:1204
#18 0x104b4474 in ProcessUtilitySlow (pstate=<error reading variable>, pstmt=0x3009a408, queryString=0x30099730 "CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);",
If there are other ways I should poke at it, let me know.
On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
Committed this. Thanks everyone!
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
So that's:
static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */
...
smgrextend(bulkstate->smgr, bulkstate->forknum,
bulkstate->pages_written++,
&zero_buffer,
true);
... where PGIOAlignedBlock is:
typedef union PGIOAlignedBlock
{
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_IO_ALIGN_SIZE)
#endif
char data[BLCKSZ];
...
We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?
Hmm.
On Sun, Feb 25, 2024 at 07:52:16AM +1300, Thomas Munro wrote:
On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
Committed this. Thanks everyone!
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245So that's:
static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */
...
smgrextend(bulkstate->smgr, bulkstate->forknum,
bulkstate->pages_written++,
&zero_buffer,
true);... where PGIOAlignedBlock is:
typedef union PGIOAlignedBlock
{
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_IO_ALIGN_SIZE)
#endif
char data[BLCKSZ];
...We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?Hmm.
GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections". It also has -H
and -K flags to force some particular higher alignment.
On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:
$ /opt/cfarm/binutils-latest/bin/objdump --section-headers ~/farm/*/HEAD/pgsqlkeep.*/src/backend/storage/smgr/bulk_write.o
/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-24_00-03-22/src/backend/storage/smgr/bulk_write.o: file format aix5coff64-rs6000
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000277c 0000000000000000 0000000000000000 000000f0 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 000000e4 000000000000277c 000000000000277c 0000286c 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .debug 0001f7ea 0000000000000000 0000000000000000 00002950 2**3
CONTENTS
/home/nm/farm/xlc32/HEAD/pgsqlkeep.2024-02-24_15-12-23/src/backend/storage/smgr/bulk_write.o: file format aixcoff-rs6000
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000880 00000000 00000000 00000180 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 0000410c 00000880 00000880 00000a00 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00000000 0000498c 0000498c 00000000 2**3
ALLOC
3 .debug 00008448 00000000 00000000 00004b24 2**3
CONTENTS
4 .except 00000018 00000000 00000000 00004b0c 2**3
CONTENTS, LOAD
$ /opt/cfarm/binutils-latest/bin/objdump --section-headers ~/farm/*/HEAD/pgsqlkeep.*/contrib/pg_prewarm/pg_prewarm.o
/home/nm/farm/gcc32/HEAD/pgsqlkeep.2024-01-21_03-16-12/contrib/pg_prewarm/pg_prewarm.o: file format aixcoff-rs6000
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000a6c 00000000 00000000 000000b4 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 00000044 00000a6c 00000a6c 00000b20 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00002550 00000ab0 00000ab0 00000000 2**3
ALLOC
3 .debug 0001c50e 00000000 00000000 00000b64 2**3
CONTENTS
/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-15_17-13-04/contrib/pg_prewarm/pg_prewarm.o: file format aix5coff64-rs6000
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000948 0000000000000000 0000000000000000 00000138 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 00000078 0000000000000948 0000000000000948 00000a80 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00002640 00000000000009c0 00000000000009c0 00000000 2**3
ALLOC
3 .debug 0001d887 0000000000000000 0000000000000000 00000af8 2**3
CONTENTS
On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:
On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:
Ah, that is a bit of a hazard that we should probably document.
I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?
On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:
On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:Ah, that is a bit of a hazard that we should probably document.
I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?
Erm, wait, how does that function-local static object work differently?