Extensible Rmgr for Table AMs

Started by Jeff Davisover 4 years ago30 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

Motivation:

I'm working on a columnar compression AM[0]/messages/by-id/20ee0b0ae6958804a88fe9580157587720faf664.camel@j-davis.com. Currently, it uses generic
xlog, which works for crash recovery and physical replication, but not
logical decoding/replication.

Extensible rmgr would enable the table AM to support its own
redo/decode hooks and WAL format, so that it could support crash
recovery, physical replication, and logical replication.

Background:

I submitted another patch[0]/messages/by-id/20ee0b0ae6958804a88fe9580157587720faf664.camel@j-davis.com to add new logical records, which could be
used to support logical decoding directly, without the need for
extensible rmgr and without any assumptions about the table AM. This
was designed to be easy to use, but inefficient. Amit raised
concerns[1]/messages/by-id/CAA4eK1JVDnbQ80ULdZuhzQkzr_yYhfON-tg=d1U7aWjK_R1ixQ@mail.gmail.com about whether it could meet the needs of zheap. Andres
suggested (off-list) that it would be better to just tackle the
extensible rmgr problem.

The idea for extensible rmgr has been proposed before[3]/messages/by-id/1229541840.4793.79.camel@ebony.2ndQuadrant. The biggest
argument against it seemed to be that there was no complete use
case[4]/messages/by-id/20992.1232667957@sss.pgh.pa.us, so the worry was that something would be left out. Columnar is
complete enough that I think it qualifies as a good use case.

A subsequent proposal[5]/messages/by-id/1266774840.7341.29872.camel@ebony was shot down because of a (potential?) need
for catalog access[6]/messages/by-id/26134.1266776040@sss.pgh.pa.us. The attached patch does not use the catalog;
instead, it relies on table AM authors choosing IDs that don't conflict
with each other. This seems like a reasonable answer, considering that
there will likely be very few table AMs that go far enough to fully
support WAL including decoding.

Are there any other major arguments/objections that I missed?

Proposal:

The attached patch (against v14, so it's easier to test columnar) is
somewhat like a simplified version of [3]/messages/by-id/1229541840.4793.79.camel@ebony.2ndQuadrant combined with refactoring to
make decoding a part of the rmgr.

* adds a new RmgrData method rm_decode
* refactors decode.c to use
the new method
* add a layer of indirection GetRmgr to find an rmgr

* fast path to find builtin rmgr in RmgrTable
* to find a custom
rmgr, traverses list of custom rmgrs that
are currently loaded
(unlikely to ever be more than a few)
* rmgr IDs from 0-127 are
reserved for builtin rmgrs
* rmgr IDs from 128-255 are reserved for
custom rmgrs
* table AM authors need to avoid collisions between
rmgr IDs

I have tested with columnar using a simple WAL format for logical
decoding only, and I'm still using generic xlog for recovery and
physical replication. I haven't tested the redo path, or how easy it
might be to do something like generic xlog.

Questions:

0. Do we want to go this route, or something simpler like my other
proposal, which introduces new logical record types[0]/messages/by-id/20ee0b0ae6958804a88fe9580157587720faf664.camel@j-davis.com?

1. I am allocating the custom rmgr list in TopMemoryContext, and it
only works when loading as a part of shared_preload_libraries. This
avoids the need for shared memory in Simon's patch[3]/messages/by-id/1229541840.4793.79.camel@ebony.2ndQuadrant. Is that the
right thing to do?

2. If we go this route, what do we do with generic xlog? It seems like
a half feature, since it doesn't work with logical decoding.

3. If the custom rmgr throws an error during redo, the server won't
start. Should we have a GUC to turn non-builtin redo into a no-op to
reduce the impact of bugs in the implementation of a custom rmgr?

4. Do we want to encourage index AMs to use this mechanism as well? I
didn't really look into how suitable it is, but at a high level it
seems reasonable.

Regards,
Jeff Davis

[0]: /messages/by-id/20ee0b0ae6958804a88fe9580157587720faf664.camel@j-davis.com
/messages/by-id/20ee0b0ae6958804a88fe9580157587720faf664.camel@j-davis.com
[1]: /messages/by-id/CAA4eK1JVDnbQ80ULdZuhzQkzr_yYhfON-tg=d1U7aWjK_R1ixQ@mail.gmail.com
/messages/by-id/CAA4eK1JVDnbQ80ULdZuhzQkzr_yYhfON-tg=d1U7aWjK_R1ixQ@mail.gmail.com
[2]: https://github.com/citusdata/citus/tree/master/src/backend/columnar
[3]: /messages/by-id/1229541840.4793.79.camel@ebony.2ndQuadrant
[4]: /messages/by-id/20992.1232667957@sss.pgh.pa.us
[5]: /messages/by-id/1266774840.7341.29872.camel@ebony
[6]: /messages/by-id/26134.1266776040@sss.pgh.pa.us

Attachments:

extensible-rmgr.difftext/x-patch; charset=UTF-8; name=extensible-rmgr.diffDownload+171-127
#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Extensible Rmgr for Table AMs

On Mon, 2021-11-08 at 15:36 -0800, Jeff Davis wrote:

The attached patch (against v14, so it's easier to test columnar) is
somewhat like a simplified version of [3] combined with refactoring
to
make decoding a part of the rmgr.

I created a wiki page here:
https://wiki.postgresql.org/wiki/ExtensibleRmgr

To coordinate reservation of RmgrIds, to avoid conflicts. I don't
expect it to be a practical problem given how much work it takes to
create a new table AM that needs full WAL support, but might as well
have some transparency on how to choose a new RmgrId.

I also updated the patch to point to the wiki page in the comments, and
added in a new RM_EXPERIMENTAL_ID that can be used while an extension
is still in development. Hopefully this will prevent people reserving
lots of RmgrIds for extensions that never get released.

Regards,
Jeff Davis

Attachments:

extensible-rmgr-pg14-v2.difftext/x-patch; charset=UTF-8; name=extensible-rmgr-pg14-v2.diffDownload+186-127
#3Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#1)
Re: Extensible Rmgr for Table AMs

On Mon, Nov 8, 2021 at 6:36 PM Jeff Davis <pgsql@j-davis.com> wrote:

Extensible rmgr would enable the table AM to support its own
redo/decode hooks and WAL format, so that it could support crash
recovery, physical replication, and logical replication.

Without taking a position on your implementation, which I have not
studied, I like this idea in concept and I think it's an important
goal.

Are there any other major arguments/objections that I missed?

ISTR some discussion of the fact that uninstalling the extension that
uses this facility, or failing to install it on your standby, will
lead to an unusable database. Personally, I don't see that as a big
problem: we should just document that if you choose to use an
extension like this, then (1) it needs to be installed on all standbys
and (2) if you ever want to get rid of it, you need to stop using it,
drop all the objects created with it, and then wait until all the WAL
previously generated by that extension is gone not only from pg_wal
but from any archived WAL files or backups that you intend to use with
that cluster before you actually nuke it. Users who don't want to
abide by those restrictions need not install such extensions. Users
who don't read the documentation might end up sad, but it doesn't seem
particularly likely, and it's hardly the only part of the
documentation that users shouldn't ignore.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Extensible Rmgr for Table AMs

On Mon, 2021-11-08 at 15:36 -0800, Jeff Davis wrote:

The attached patch (against v14, so it's easier to test columnar) is
somewhat like a simplified version of [3] combined with refactoring
to
make decoding a part of the rmgr.

New patches attached (v3). Essentially the content as v2, but split
into 3 patches and rebased.

Review on patch 0001 would be helpful. It makes decoding just another
method of rmgr, which makes a lot of sense to me from a code
organization standpoint regardless of the other patches. Is there any
reason not to do that?

The other patches then make rmgr extensible, which in turn makes
decoding extensible and solves the logical replication problem for
custom table AMs. The most obvious way to make rmgr extensible would be
to expand the rmgr table, but I was concerned about making that dynamic
(right now the structure is entirely constant and perhaps that's
important for some optimizations?). So, at the cost of complexity I
made a separate, dynamic rmgr table to hold the custom entries.

The custom rmgr API would probably be marked "experimental" for a
while, and I don't expect lots of people to use it given the challenges
of a production-quality table AM. But it's still important, because
without it a table AM has no chance to participate in logical
replication.

Regards,
Jeff Davis

Attachments:

0001-Make-logical-decoding-a-part-of-the-rmgr.patchtext/x-patch; charset=UTF-8; name=0001-Make-logical-decoding-a-part-of-the-rmgr.patchDownload+69-113
0002-Add-macro-GetRmgr-in-preparation-for-extensible-rmgr.patchtext/x-patch; charset=UTF-8; name=0002-Add-macro-GetRmgr-in-preparation-for-extensible-rmgr.patchDownload+19-18
0003-Custom-Rmgr.patchtext/x-patch; charset=UTF-8; name=0003-Custom-Rmgr.patchDownload+101-2
#5Julien Rouhaud
rjuju123@gmail.com
In reply to: Jeff Davis (#4)
Re: Extensible Rmgr for Table AMs

Hi,

On Mon, Jan 17, 2022 at 12:42:06AM -0800, Jeff Davis wrote:

On Mon, 2021-11-08 at 15:36 -0800, Jeff Davis wrote:

The attached patch (against v14, so it's easier to test columnar) is
somewhat like a simplified version of [3] combined with refactoring
to
make decoding a part of the rmgr.

New patches attached (v3). Essentially the content as v2, but split
into 3 patches and rebased.

Review on patch 0001 would be helpful. It makes decoding just another
method of rmgr, which makes a lot of sense to me from a code
organization standpoint regardless of the other patches. Is there any
reason not to do that?

I think that it's a clean and sensible approach, so +1 for me.

There's a bit of 0003 present in 002:

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..612b1b3723 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -738,7 +738,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
        return false;
    }
-   if (record->xl_rmid > RM_MAX_ID)
+   if (record->xl_rmid > RM_MAX_ID && record->xl_rmid < RM_CUSTOM_MIN_ID)
    {

The other patches then make rmgr extensible, which in turn makes
decoding extensible and solves the logical replication problem for
custom table AMs. The most obvious way to make rmgr extensible would be
to expand the rmgr table, but I was concerned about making that dynamic
(right now the structure is entirely constant and perhaps that's
important for some optimizations?). So, at the cost of complexity I
made a separate, dynamic rmgr table to hold the custom entries.

I'm not sure about performance overhead, but allowing extension to access the
main table directly seems a bit scary. We definitely accepted some overhead
with the various extensible support, and this new GetRmgr() doesn't look like
it's going to cost a lot for the builtin case, especially compared to the cost
of any of the code associated with the rmgr.

Also, to answer your initial email I think it's a better way to go compared to
your previous approach, given the limitations and performance of the generic
xlog infrastructure, and hopefully index AMs should be able to rely on this
too.

A few random comments on 0003:

 #define RM_MAX_ID              (RM_NEXT_ID - 1)
+#define RM_CUSTOM_MIN_ID       128
+#define RM_CUSTOM_MAX_ID       UINT8_MAX

It would be a good idea to add a StaticAssertStmt here to make sure that
there's no overlap in the ranges.

+
+/*
+ * RmgrId to use for extensions that require an RmgrId, but are still in
+ * development and have not reserved their own unique RmgrId yet. See:
+ * https://wiki.postgresql.org/wiki/ExtensibleRmgr
+ */
+#define RM_EXPERIMENTAL_ID     128

I'm a bit dubious about the whole "register your ID in the Wiki" approach. It
might not be a problem since there probably won't be hundreds of users, and I
don't have any better suggestion since it has to be consistent across nodes.

+   elog(LOG, "registering customer rmgr \"%s\" with ID %d",
+        rmgr->rm_name, rmid);

Should it be a DEBUG message instead? Also s/customer/custom/

+RmgrData
+GetCustomRmgr(RmgrId rmid)
+{
+   if (rmid < RM_CUSTOM_MIN_ID || rmid > RM_CUSTOM_MAX_ID)
+       elog(PANIC, "custom rmgr id %d out of range", rmid);

Should this be an assert?

+#define GetBuiltinRmgr(rmid) RmgrTable[(rmid)]
+#define GetRmgr(rmid) ((rmid < RM_CUSTOM_MIN_ID) ? \
+                      GetBuiltinRmgr(rmid) : GetCustomRmgr(rmid))

rmid should be protected in the macro.

The custom rmgr API would probably be marked "experimental" for a
while, and I don't expect lots of people to use it given the challenges
of a production-quality table AM. But it's still important, because
without it a table AM has no chance to participate in logical
replication.

How you plan to mark it experimental?

#6Jeff Davis
pgsql@j-davis.com
In reply to: Julien Rouhaud (#5)
Re: Extensible Rmgr for Table AMs

On Tue, 2022-01-18 at 17:53 +0800, Julien Rouhaud wrote:

Review on patch 0001 would be helpful. It makes decoding just
another
method of rmgr, which makes a lot of sense to me from a code
organization standpoint regardless of the other patches. Is there
any
reason not to do that?

I think that it's a clean and sensible approach, so +1 for me.

Thank you, committed 0001. Other patches not committed yet.

There's a bit of 0003 present in 002:

I just squashed 0002 and 0003 together. Not large enough to keep
separate.

A few random comments on 0003:

#define RM_MAX_ID              (RM_NEXT_ID - 1)
+#define RM_CUSTOM_MIN_ID       128
+#define RM_CUSTOM_MAX_ID       UINT8_MAX

It would be a good idea to add a StaticAssertStmt here to make sure
that
there's no overlap in the ranges.

Done.

+
+/*
+ * RmgrId to use for extensions that require an RmgrId, but are
still in
+ * development and have not reserved their own unique RmgrId yet.
See:
+ * https://wiki.postgresql.org/wiki/ExtensibleRmgr
+ */
+#define RM_EXPERIMENTAL_ID     128

I'm a bit dubious about the whole "register your ID in the Wiki"
approach. It
might not be a problem since there probably won't be hundreds of
users, and I
don't have any better suggestion since it has to be consistent across
nodes.

Agree, but I don't see a better approach, either.

I do some sanity checking, which should catch collisions when they
happen.

+   elog(LOG, "registering customer rmgr \"%s\" with ID %d",
+        rmgr->rm_name, rmid);

Should it be a DEBUG message instead? Also s/customer/custom/

It seems like a fairly important thing to have in the log. Only a
couple extensions will ever encounter this message, and only at server
start.

Typo fixed.

+RmgrData
+GetCustomRmgr(RmgrId rmid)
+{
+   if (rmid < RM_CUSTOM_MIN_ID || rmid > RM_CUSTOM_MAX_ID)
+       elog(PANIC, "custom rmgr id %d out of range", rmid);

Should this be an assert?

If we make it an Assert, then it won't be caught in production builds.

+#define GetBuiltinRmgr(rmid) RmgrTable[(rmid)]
+#define GetRmgr(rmid) ((rmid < RM_CUSTOM_MIN_ID) ? \
+                      GetBuiltinRmgr(rmid) : GetCustomRmgr(rmid))

rmid should be protected in the macro.

Done.

How you plan to mark it experimental?

I suppose it doesn't need to be marked explicitly -- there are other
APIs that change. For instance, the ProcessUtility_hook changed, and
that's used much more widely.

As long as we generally agree that some kind of custom rmgrs are the
way to go, if we change the API or implementation around from version
to version I can easily update my table access method.

Regards,
Jeff Davis

Attachments:

v4-0001-Extensible-rmgr.patchtext/x-patch; charset=UTF-8; name=v4-0001-Extensible-rmgr.patchDownload+129-18
#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Jeff Davis (#6)
Re: Extensible Rmgr for Table AMs

Hi,

On Mon, Jan 31, 2022 at 06:36:59PM -0800, Jeff Davis wrote:

On Tue, 2022-01-18 at 17:53 +0800, Julien Rouhaud wrote:

There's a bit of 0003 present in 002:

I just squashed 0002 and 0003 together. Not large enough to keep
separate.

Agreed.

+   elog(LOG, "registering customer rmgr \"%s\" with ID %d",
+        rmgr->rm_name, rmid);

Should it be a DEBUG message instead? Also s/customer/custom/

It seems like a fairly important thing to have in the log. Only a
couple extensions will ever encounter this message, and only at server
start.

Ok.

+RmgrData
+GetCustomRmgr(RmgrId rmid)
+{
+   if (rmid < RM_CUSTOM_MIN_ID || rmid > RM_CUSTOM_MAX_ID)
+       elog(PANIC, "custom rmgr id %d out of range", rmid);

Should this be an assert?

If we make it an Assert, then it won't be caught in production builds.

Sure, but it seems something so fundamental that it should get done right
during the development phase rather than spending cycles in optimized builds to
check for it.
But even if that happened it would get caught by the final "not found" PANIC
anyway, and for end users I don't think that getting this error instead would
make much difference.

How you plan to mark it experimental?

I suppose it doesn't need to be marked explicitly -- there are other
APIs that change. For instance, the ProcessUtility_hook changed, and
that's used much more widely.
As long as we generally agree that some kind of custom rmgrs are the
way to go, if we change the API or implementation around from version
to version I can easily update my table access method.

Agreed, API breaks happen often and that's not really a problem for third-party
extensions, especially if that comes with hard compile failure.

Other than that the patch looks good to me, as you said we just need a decision
on whether custom rmgrs are wanted or not.

#8Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#7)
Re: Extensible Rmgr for Table AMs

On Tue, Feb 01, 2022 at 12:39:38PM +0800, Julien Rouhaud wrote:

Other than that the patch looks good to me, as you said we just need a decision
on whether custom rmgrs are wanted or not.

One last thing, did you do some benchmark with a couple custom rmgr to see how
much the O(n) access is showing up in profiles?

#9Jeff Davis
pgsql@j-davis.com
In reply to: Julien Rouhaud (#8)
Re: Extensible Rmgr for Table AMs

On Tue, 2022-02-01 at 20:45 +0800, Julien Rouhaud wrote:

On Tue, Feb 01, 2022 at 12:39:38PM +0800, Julien Rouhaud wrote:

Other than that the patch looks good to me, as you said we just
need a decision
on whether custom rmgrs are wanted or not.

One last thing, did you do some benchmark with a couple custom rmgr
to see how
much the O(n) access is showing up in profiles?

What kind of a test case would be reasonable there? You mean having a
lot of custom rmgrs?

I was expecting that few people would have more than one custom rmgr
loaded anyway, so a sparse array or hashtable seemed wasteful. If
custom rmgrs become popular we probably need to have a larger ID space
anyway, but it seems like overengineering to do so now.

Regards,
Jeff Davis

#10Julien Rouhaud
rjuju123@gmail.com
In reply to: Jeff Davis (#9)
Re: Extensible Rmgr for Table AMs

Hi,

On Tue, Feb 01, 2022 at 03:38:32PM -0800, Jeff Davis wrote:

On Tue, 2022-02-01 at 20:45 +0800, Julien Rouhaud wrote:

One last thing, did you do some benchmark with a couple custom rmgr
to see how
much the O(n) access is showing up in profiles?

What kind of a test case would be reasonable there? You mean having a
lot of custom rmgrs?

I was expecting that few people would have more than one custom rmgr
loaded anyway, so a sparse array or hashtable seemed wasteful. If
custom rmgrs become popular we probably need to have a larger ID space
anyway, but it seems like overengineering to do so now.

I agree that having dozen of custom rmgrs doesn't seem likely, but I also have
no idea of how much overhead you get by not doing a direct array access. I
think it would be informative to benchmark something like simple OLTP write
workload on a fast storage (or a ramdisk, or with fsync off...), with the used
rmgr being the 1st and the 2nd custom rmgr. Both scenario still seems
plausible and shouldn't degenerate on good hardware.

#11Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#10)
Re: Extensible Rmgr for Table AMs

On Thu, Feb 3, 2022 at 12:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

I agree that having dozen of custom rmgrs doesn't seem likely, but I also have
no idea of how much overhead you get by not doing a direct array access. I
think it would be informative to benchmark something like simple OLTP write
workload on a fast storage (or a ramdisk, or with fsync off...), with the used
rmgr being the 1st and the 2nd custom rmgr. Both scenario still seems
plausible and shouldn't degenerate on good hardware.

I think it would be hard to measure the overhead of this approach on a
macrobenchmark. That having been said, I find this a surprising
implementation choice. I think that the approaches that are most worth
considering are:

(1) reallocate the array if needed so that we can continue to just do
RmgrTable[rmid]
(2) have one array for builtins and a second array for extensions and
do rmid < RM_CUSTOM_MIN_ID ? BuiltinRmgrTable[rmid] :
ExtensionRmgrTable[rmid]
(3) change RmgrTable to be an array of pointers to structs rather than
an an array of structs. then the structs don't move around and can be
const, but the pointers can be moved into a larger array if required

I'm not really sure which is best. My intuition for what will be
cheapest on modern hardware is pretty shaky. However, I can't see how
it can be the thing the patch is doing now; a linear search seems like
it has to be the slowest option.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Julien Rouhaud
rjuju123@gmail.com
In reply to: Robert Haas (#11)
Re: Extensible Rmgr for Table AMs

Hi,

On Fri, Feb 04, 2022 at 09:10:42AM -0500, Robert Haas wrote:

On Thu, Feb 3, 2022 at 12:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

I agree that having dozen of custom rmgrs doesn't seem likely, but I also have
no idea of how much overhead you get by not doing a direct array access. I
think it would be informative to benchmark something like simple OLTP write
workload on a fast storage (or a ramdisk, or with fsync off...), with the used
rmgr being the 1st and the 2nd custom rmgr. Both scenario still seems
plausible and shouldn't degenerate on good hardware.

I think it would be hard to measure the overhead of this approach on a
macrobenchmark.

Yeah that's also my initial thought, but I wouldn't be terribly surprised to be
wrong.

That having been said, I find this a surprising
implementation choice. I think that the approaches that are most worth
considering are:

(1) reallocate the array if needed so that we can continue to just do
RmgrTable[rmid]
(2) have one array for builtins and a second array for extensions and
do rmid < RM_CUSTOM_MIN_ID ? BuiltinRmgrTable[rmid] :
ExtensionRmgrTable[rmid]
(3) change RmgrTable to be an array of pointers to structs rather than
an an array of structs. then the structs don't move around and can be
const, but the pointers can be moved into a larger array if required

I'm not really sure which is best. My intuition for what will be
cheapest on modern hardware is pretty shaky. However, I can't see how
it can be the thing the patch is doing now; a linear search seems like
it has to be the slowest option.

I guess the idea was to have a compromise between letting rmgr authors choose
arbitrary ids to avoid any conflicts, especially with private implementations,
without wasting too much memory. But those approaches would be pretty much
incompatible with the current definition:

+#define RM_CUSTOM_MIN_ID       128
+#define RM_CUSTOM_MAX_ID       UINT8_MAX

even if you only allocate up to the max id found, nothing guarantees that you
won't get a quite high id.

#13Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#12)
Re: Extensible Rmgr for Table AMs

On Fri, Feb 4, 2022 at 9:48 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

I guess the idea was to have a compromise between letting rmgr authors choose
arbitrary ids to avoid any conflicts, especially with private implementations,
without wasting too much memory. But those approaches would be pretty much
incompatible with the current definition:

+#define RM_CUSTOM_MIN_ID       128
+#define RM_CUSTOM_MAX_ID       UINT8_MAX

even if you only allocate up to the max id found, nothing guarantees that you
won't get a quite high id.

Right, which I guess raises another question: if the maximum is
UINT8_MAX, which BTW I find perfectly reasonable, why are we not just
defining this as an array of size 256? There's no point in adding code
complexity to save a few kB of memory.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Julien Rouhaud
rjuju123@gmail.com
In reply to: Robert Haas (#13)
Re: Extensible Rmgr for Table AMs

On Fri, Feb 04, 2022 at 09:53:09AM -0500, Robert Haas wrote:

On Fri, Feb 4, 2022 at 9:48 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

I guess the idea was to have a compromise between letting rmgr authors choose
arbitrary ids to avoid any conflicts, especially with private implementations,
without wasting too much memory. But those approaches would be pretty much
incompatible with the current definition:

+#define RM_CUSTOM_MIN_ID       128
+#define RM_CUSTOM_MAX_ID       UINT8_MAX

even if you only allocate up to the max id found, nothing guarantees that you
won't get a quite high id.

Right, which I guess raises another question: if the maximum is
UINT8_MAX, which BTW I find perfectly reasonable, why are we not just
defining this as an array of size 256? There's no point in adding code
complexity to save a few kB of memory.

Agreed, especially if combined with your suggested approach 3 (array of
pointers).

#15Jeff Davis
pgsql@j-davis.com
In reply to: Julien Rouhaud (#14)
Re: Extensible Rmgr for Table AMs

On Fri, 2022-02-04 at 22:56 +0800, Julien Rouhaud wrote:

Right, which I guess raises another question: if the maximum is
UINT8_MAX, which BTW I find perfectly reasonable, why are we not
just
defining this as an array of size 256? There's no point in adding
code
complexity to save a few kB of memory.

Agreed, especially if combined with your suggested approach 3 (array
of
pointers).

Implemented and attached. I also updated pg_waldump and pg_rewind to do
something reasonable.

Additionally, I now have a reasonably complete implementation of a
custom resource manager now:

https://github.com/citusdata/citus/tree/custom-rmgr-15

(Not committed or intended to actually be used right now -- just a
POC.)

Offline, Andres mentioned that I should test recovery performance if we
take your approach, because making the RmgrTable non-const could impact
optimizations. Not sure if that would be a practical concern compared
to all the other work done at REDO time.

Regards,
Jeff Davis

Attachments:

0001-Extensible-rmgr.patchtext/x-patch; charset=UTF-8; name=0001-Extensible-rmgr.patchDownload+168-49
#16Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#15)
Re: Extensible Rmgr for Table AMs

Hi,

On 2022-03-23 21:43:08 -0700, Jeff Davis wrote:

/* must be kept in sync with RmgrData definition in xlog_internal.h */
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
-	{ name, redo, desc, identify, startup, cleanup, mask, decode },
+	&(struct RmgrData){ name, redo, desc, identify, startup, cleanup, mask, decode },
-const RmgrData RmgrTable[RM_MAX_ID + 1] = {
+static RmgrData *RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
};

I think this has been discussed before, but to me it's not obvious that it's a
good idea to change RmgrTable from RmgrData to RmgrData *. That adds an
indirection, without obvious benefit.

+
+/*
+ * Start up all resource managers.
+ */
+void
+StartupResourceManagers()

(void)

+void
+RegisterCustomRmgr(RmgrId rmid, RmgrData *rmgr)
+{
+	if (rmid < RM_MIN_CUSTOM_ID)
+		ereport(PANIC, errmsg("custom rmgr id %d is out of range", rmid));
+
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR,
+				(errmsg("custom rmgr must be registered while initializing modules in shared_preload_libraries")));
+
+	ereport(LOG,
+			(errmsg("registering custom rmgr \"%s\" with ID %d",
+					rmgr->rm_name, rmid)));
+
+	if (RmgrTable[rmid] != NULL)
+		ereport(PANIC,
+				(errmsg("custom rmgr ID %d already registered with name \"%s\"",
+						rmid, RmgrTable[rmid]->rm_name)));
+
+	/* check for existing rmgr with the same name */
+	for (int i = 0; i <= RM_MAX_ID; i++)
+	{
+		const RmgrData *existing_rmgr = RmgrTable[i];
+
+		if (existing_rmgr == NULL)
+			continue;
+
+		if (!strcmp(existing_rmgr->rm_name, rmgr->rm_name))
+			ereport(PANIC,
+					(errmsg("custom rmgr \"%s\" has the same name as builtin rmgr",
+							existing_rmgr->rm_name)));
+	}
+
+	/* register it */
+	RmgrTable[rmid] = rmgr;
+}

Random idea: Might be worth emitting the id->name mapping just after a redo
location is determined, to make it easier to debug things.

+RmgrData *
+GetRmgr(RmgrId rmid)
+{
+	return RmgrTable[rmid];
+}

Given this is so simple, why incur the cost of a function call? Rather than
continuing to expose RmgrTable and move GetRmgr() into the header, as a static
inline? As-is this also prevent the compiler from optimizing across repeated
GetRmgr() calls (which often won't be possible anyway, but still)..

-	if (record->xl_rmid > RM_MAX_ID)
+	if (record->xl_rmid > RM_MAX_BUILTIN_ID && record->xl_rmid < RM_MIN_CUSTOM_ID)
{
report_invalid_record(state,
"invalid resource manager ID %u at %X/%X",

Shouldn't this continue to enforce RM_MAX_ID as well?

@@ -1604,12 +1603,7 @@ PerformWalRecovery(void)

InRedo = true;

-		/* Initialize resource managers */
-		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-		{
-			if (RmgrTable[rmid].rm_startup != NULL)
-				RmgrTable[rmid].rm_startup();
-		}
+		StartupResourceManagers();

Personally I'd rather name it ResourceManagersStartup() or RmgrStartup().

@@ -1871,7 +1860,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
xlogrecovery_redo(xlogreader, *replayTLI);

/* Now apply the WAL record itself */
-	RmgrTable[record->xl_rmid].rm_redo(xlogreader);
+	GetRmgr(record->xl_rmid)->rm_redo(xlogreader);

/*
* After redo, check whether the backup pages associated with the WAL

So we have just added one indirect call and one pointer indirection
(previously RmgrTable could be resolved by the linker, now it needs to be
dereferenced again), that's not too bad. I was afraid there'd be multiple
calls.

@@ -2101,16 +2090,16 @@ xlog_outdesc(StringInfo buf, XLogReaderState *record)
uint8 info = XLogRecGetInfo(record);
const char *id;

-	appendStringInfoString(buf, RmgrTable[rmid].rm_name);
+	appendStringInfoString(buf, GetRmgr(rmid)->rm_name);
appendStringInfoChar(buf, '/');
-	id = RmgrTable[rmid].rm_identify(info);
+	id = GetRmgr(rmid)->rm_identify(info);
if (id == NULL)
appendStringInfo(buf, "UNKNOWN (%X): ", info & ~XLR_INFO_MASK);
else
appendStringInfo(buf, "%s: ", id);
-	RmgrTable[rmid].rm_desc(buf, record);
+	GetRmgr(rmid)->rm_desc(buf, record);
}

Like here. It's obviously not as performance critical as replay. But it's
still a shame to add 3 calls to GetRmgr, that each then need to dereference
RmgrTable. The compiler won't be able to optimize any of that away.

@@ -117,8 +117,8 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor

rmid = XLogRecGetRmid(record);

-	if (RmgrTable[rmid].rm_decode != NULL)
-		RmgrTable[rmid].rm_decode(ctx, &buf);
+	if (GetRmgr(rmid)->rm_decode != NULL)
+		GetRmgr(rmid)->rm_decode(ctx, &buf);
else
{
/* just deal with xid, and done */

This one might actually matter, overhead wise.

@@ -473,7 +473,7 @@ static void
XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
{
const char *id;
-	const RmgrDescData *desc = &RmgrDescTable[XLogRecGetRmid(record)];
+	const RmgrDescData *desc = GetRmgrDesc(XLogRecGetRmid(record));
uint32		rec_len;
uint32		fpi_len;
RelFileNode rnode;
@@ -658,7 +658,7 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
* calculate column totals.
*/

- for (ri = 0; ri < RM_NEXT_ID; ri++)
+ for (ri = 0; ri < RM_MAX_ID; ri++)
{
total_count += stats->rmgr_stats[ri].count;
total_rec_len += stats->rmgr_stats[ri].rec_len;

@@ -694,6 +694,9 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
fpi_len = stats->rmgr_stats[ri].fpi_len;
tot_len = rec_len + fpi_len;

+			if (ri > RM_MAX_BUILTIN_ID && count == 0)
+				continue;
+

Ah, I see. I had written a concerned comment about the previous hunk...

XLogDumpStatsRow(desc->rm_name,
count, total_count, rec_len, total_rec_len,
fpi_len, total_fpi_len, tot_len, total_len);
@@ -913,16 +916,16 @@ main(int argc, char **argv)
exit(EXIT_SUCCESS);
}

-					for (i = 0; i <= RM_MAX_ID; i++)
+					for (i = 0; i <= RM_MAX_BUILTIN_ID; i++)
{
-						if (pg_strcasecmp(optarg, RmgrDescTable[i].rm_name) == 0)
+						if (pg_strcasecmp(optarg, GetRmgrDesc(i)->rm_name) == 0)
{
config.filter_by_rmgr[i] = true;
config.filter_by_rmgr_enabled = true;
break;
}
}
-					if (i > RM_MAX_ID)
+					if (i > RM_MAX_BUILTIN_ID)
{
pg_log_error("resource manager \"%s\" does not exist",
optarg);

So we can't filter by rmgr id for non-builtin rmgr's? That sucks. Maybe add a
custom:<i> syntax? Or generally allow numerical identifiers in addition to the
names?

Greetings,

Andres Freund

#17Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#16)
Re: Extensible Rmgr for Table AMs

On Mon, 2022-03-28 at 11:00 -0700, Andres Freund wrote:

I think this has been discussed before, but to me it's not obvious
that it's a
good idea to change RmgrTable from RmgrData to RmgrData *. That adds
an
indirection, without obvious benefit.

I did some performance tests. I created a narrow table, took a base
backup, loaded 100M records, finished the base backup. Then I recovered
using the different build combinations (patched/unpatched, clang/gcc).

compiler run1 run2
unpatched: gcc 102s 106s
patched: gcc 107s 105s
unpatched: clang 109s 110s
patched: clang 109s 111s

I don't see a clear signal that this patch worsens performance. The
102s run was the very first run, so I suspect it was just due to the
processor starting out cold. Let me know if you think the test is
testing the right paths.

Perhaps I should address your other perf concerns around GetRmgr (make
it static inline, reduce number of calls), and then leave the
indirection for the sake of cleanliness?

If you are still concerned, I can switch back to separate tables to
eliminate the indirection for built-in rmgrs. Separate rmgr tables
still require a branch (to decide which table to access), but it should
be a highly predictable one.

Regards,
Jeff Davis

#18Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#16)
Re: Extensible Rmgr for Table AMs

On Mon, 2022-03-28 at 11:00 -0700, Andres Freund wrote:

+/*
+ * Start up all resource managers.
+ */
+void
+StartupResourceManagers()

(void)

Fixed.

Random idea: Might be worth emitting the id->name mapping just after
a redo
location is determined, to make it easier to debug things.

Not quite sure I understood this idea, do you mean dump all rmgrs and
the IDs when performing recovery?

+RmgrData *
+GetRmgr(RmgrId rmid)
+{
+	return RmgrTable[rmid];
+}

Given this is so simple, why incur the cost of a function call?
Rather than
continuing to expose RmgrTable and move GetRmgr() into the header, as
a static
inline?

Made it static inline.

Shouldn't this continue to enforce RM_MAX_ID as well?

Done.

Personally I'd rather name it ResourceManagersStartup() or
RmgrStartup().

Done.

Like here. It's obviously not as performance critical as replay. But
it's
still a shame to add 3 calls to GetRmgr, that each then need to
dereference
RmgrTable. The compiler won't be able to optimize any of that away.

Changed to only call it once and save it in a variable for each call
site where it makes sense.

So we can't filter by rmgr id for non-builtin rmgr's? That sucks.
Maybe add a
custom:<i> syntax? Or generally allow numerical identifiers in
addition to the
names?

Good idea. I changed it to allow "custom###" to mean the custom rmgr
with ID ### (3 digits).

I still may change it to go back to two RmgrTables (one for builtin and
one for custom) to remove the lingering performance doubts. Other than
that, and some cleanup, this is pretty close to the version I intend to
commit.

Regards,
Jeff Davis

Attachments:

v6-0001-Extensible-rmgr.patchtext/x-patch; charset=UTF-8; name=v6-0001-Extensible-rmgr.patchDownload+239-59
#19Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#17)
Re: Extensible Rmgr for Table AMs

Hi,

On 2022-03-31 14:20:51 -0700, Jeff Davis wrote:

If you are still concerned, I can switch back to separate tables to
eliminate the indirection for built-in rmgrs. Separate rmgr tables
still require a branch (to decide which table to access), but it should
be a highly predictable one.

I still think the easiest and fastest would be to just make RmgrTable longer,
and not const. When registering, copy the caller provided struct into the
respective RmgrData element. Yes, we'd waste a bit of space at the end of the
array, but it's typically not going to be touched and thus not be backed by
"actual" memory.

Greetings,

Andres Freund

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#18)
Re: Extensible Rmgr for Table AMs

On Mon, Apr 4, 2022 at 8:33 AM Jeff Davis <pgsql@j-davis.com> wrote:

I still may change it to go back to two RmgrTables (one for builtin and
one for custom) to remove the lingering performance doubts. Other than
that, and some cleanup, this is pretty close to the version I intend to
commit.

I just had a quick look over the v6 patch, few comments:

1) Why can't rmid be chosen by the extensions in sequential order from
(129 till 255), say, to start with a columnar extension can choose
129, another extension can register 130 and so on right? This way, the
RmgrStartup, RmgrCleanup, and XLogDumpDisplayRecord can avoid looping
over all the 256 entries, with a global variable like
current_max_rmid(representing the current number of rmgers)? Is there
any specific reason to allow them to choose rmgrid randomly between
129 till 255? Am I missing some discussion here?
2) RM_MAX_ID is now UINT8_MAX - do we need to make it configurable? or
do we think that only a few (127) custom rmgrs can exist?
3) Do we need to talk about extensible rmgrs and
https://wiki.postgresql.org/wiki/ExtensibleRmgr in documentation
somewhere? This will help extension developers to refer when needed.
4) Do we need to change this description?
       <para>
        The default value of this setting is the empty string, which disables
        the feature.  It can be set to <literal>all</literal> to check all
        records, or to a comma-separated list of resource managers to check
        only records originating from those resource managers.  Currently,
        the supported resource managers are <literal>heap</literal>,
        <literal>heap2</literal>, <literal>btree</literal>,
<literal>hash</literal>,
        <literal>gin</literal>, <literal>gist</literal>,
<literal>sequence</literal>,
        <literal>spgist</literal>, <literal>brin</literal>, and
<literal>generic</literal>. Onl
        superusers can change this setting.
5) Do we need to add a PGDLLIMPORT qualifier to RegisterCustomRmgr()
(possibly for RmgrStartup, RmgrTable, RmgrCleanup as well?)  so that
the Windows versions of extensions can use this feature?
6) Should the below strcmp pg_strcmp? The custom rmgrs can still use
rm_name with different cases and below code fail to catch it?
+ if (!strcmp(existing_rmgr->rm_name, rmgr->rm_name))
+ ereport(PANIC,
+ (errmsg("custom rmgr \"%s\" has the same name as builtin rmgr",
+ existing_rmgr->rm_name)));
7) What's the intention of the below code? It seems like below it
checks if there's already a rmgr with the given name (RM_MAX_ID). Had
it been RM_MAX_BUILTIN_ID instead of  RM_MAX_ID, the error message
does make sense. Is the intention here to not have duplicate rmgrs in
the entire RM_MAX_ID(both builtin and custom)?
+ /* check for existing rmgr with the same name */
+ for (int i = 0; i <= RM_MAX_ID; i++)
+ {
+ const RmgrData *existing_rmgr = RmgrTable[i];
+
+ if (existing_rmgr == NULL)
+ continue;
+
+ if (!strcmp(existing_rmgr->rm_name, rmgr->rm_name))
+ ereport(PANIC,
+ (errmsg("custom rmgr \"%s\" has the same name as builtin rmgr",
+ existing_rmgr->rm_name)));
8) Per https://wiki.postgresql.org/wiki/ExtensibleRmgr, 128 is
reserved, can we have it as a macro? Or something like (RM_MAX_ID/2+1)
+#define RM_MIN_CUSTOM_ID 128
9) Thinking if there's a way to test the code, maybe exposing
RegisterCustomRmgr as a function? I think we need to have a dummy
extension that will be used for testing this kind of patches/features
but that's a separate discussion IMO.

Regards,
Bharath Rupireddy.

#21Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#19)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#20)
#23Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#22)
#24Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#27)
#29Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#24)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#24)