pg_subscription.subslotname is wrongly marked NOT NULL
In all branches back to v10, initdb marks pg_subscription.subslotname
as NOT NULL:
# \d pg_subscription
Table "pg_catalog.pg_subscription"
Column | Type | Collation | Nullable | Default
-----------------+---------+-----------+----------+---------
oid | oid | | not null |
subdbid | oid | | not null |
subname | name | | not null |
subowner | oid | | not null |
subenabled | boolean | | not null |
subbinary | boolean | | not null |
subconninfo | text | C | not null |
subslotname | name | | not null |
subsynccommit | text | C | not null |
subpublications | text[] | C | not null |
Nonetheless, CREATE/ALTER SUBSCRIPTION blithely set it to null
when slot_name = NONE is specified.
This apparently causes few ill effects, unless somebody decides
to JIT-compile deconstruction of pg_subscription tuples. Which
is why all of Andres' JIT-enabled buildfarm animals are unhappy
with 9de77b545 --- quite unintentionally, that commit added a
test case that exposed the problem.
What would we like to do about this? Removing the NOT NULL
marking wouldn't be too hard in HEAD, but telling users to
fix it manually in the back branches seems like a mess.
On the whole it seems like changing the code to use some other
representation of slot_name = NONE, like say an empty string,
would be less of a mess.
It's also a bit annoying that we have no mechanized checks that
would catch this inconsistency. If JIT is going to be absolutely
dependent on NOT NULL markings being accurate, we can't really
have such a laissez-faire attitude to C code getting it wrong.
Thoughts?
regards, tom lane
I wrote:
In all branches back to v10, initdb marks pg_subscription.subslotname
as NOT NULL: ...
Nonetheless, CREATE/ALTER SUBSCRIPTION blithely set it to null
when slot_name = NONE is specified.
What would we like to do about this? Removing the NOT NULL
marking wouldn't be too hard in HEAD, but telling users to
fix it manually in the back branches seems like a mess.
After further thought, it seems like changing the definition that
subslotname is null for "NONE" is unworkable, because client-side
code might be depending on that. (pg_dump certainly is; we could
change that, but other code might have the same expectation.)
What I propose we do is
(1) Fix the NOT NULL marking in HEAD and v13. We could perhaps
alter it in older branches as well, but we cannot force initdb
so such a change would only affect newly-initdb'd installations.
(2) In pre-v13 branches, hack the JIT tuple deconstruction code
to be specifically aware that it can't trust attnotnull for
pg_subscription.subslotname. Yeah, it's ugly, but at least it's
not ugly going forwards.
I haven't looked to see where or how we might do (2), but I assume
it's possible.
It's also a bit annoying that we have no mechanized checks that
would catch this inconsistency. If JIT is going to be absolutely
dependent on NOT NULL markings being accurate, we can't really
have such a laissez-faire attitude to C code getting it wrong.
It seems like at least in assert-enabled builds, we'd better have
a cross-check for that. I'm not sure where's the best place.
regards, tom lane
I wrote:
(2) In pre-v13 branches, hack the JIT tuple deconstruction code
to be specifically aware that it can't trust attnotnull for
pg_subscription.subslotname. Yeah, it's ugly, but at least it's
not ugly going forwards.
Concretely, as attached for v12.
regards, tom lane
Attachments:
work-around-bogus-attnotnull-v12.patchtext/x-diff; charset=us-ascii; name=work-around-bogus-attnotnull-v12.patchDownload+27-2
I wrote:
It's also a bit annoying that we have no mechanized checks that
would catch this inconsistency. If JIT is going to be absolutely
dependent on NOT NULL markings being accurate, we can't really
have such a laissez-faire attitude to C code getting it wrong.
It seems like at least in assert-enabled builds, we'd better have
a cross-check for that. I'm not sure where's the best place.
I concluded that we should put this into CatalogTupleInsert and
CatalogTupleUpdate. The bootstrap data path already has a check
(see InsertOneNull()), and so does the executor, so we only need
to worry about tuples that're built manually by catalog manipulation
code. I think all of that goes through these functions. Hence,
as attached.
... and apparently, I should have done this task first, because
damn if it didn't immediately expose another bug of the same ilk.
pg_subscription_rel.srsublsn also needs to be marked nullable.
regards, tom lane
Attachments:
add-attnotnull-assertions.patchtext/x-diff; charset=us-ascii; name=add-attnotnull-assertions.patchDownload+45-0
I wrote:
pg_subscription_rel.srsublsn also needs to be marked nullable.
Not only is it wrongly marked attnotnull, but two of the three places
that read it are doing so unsafely (ie, as though it *were*
non-nullable). So I think we'd better fix it as attached.
regards, tom lane
Attachments:
fix-srsublsn-usage-for-nullness.patchtext/x-diff; charset=us-ascii; name=fix-srsublsn-usage-for-nullness.patchDownload+31-5
Mopping this up ... the attached patch against v12 shows the portions
of 72eab84a5 and 0fa0b487b that I'm thinking of putting into v10-v12.
The doc changes, which just clarify that subslotname and srsublsn can
be null, should be uncontroversial. The changes in pg_subscription.c
prevent it from accessing data that might not be there. 99.999% of
the time, that doesn't matter; we'd copy garbage into
SubscriptionRelState.lsn, but the callers shouldn't look at that field
in states where it's not valid. However, it's possible that the code
could access data off the end of the heap page, and at least in theory
that could lead to a SIGSEGV.
What I'm not quite sure about is whether to add the BKI_FORCE_NULL
annotations to the headers or not. There are some pros and cons:
* While Catalog.pm has had support for BKI_FORCE_NULL for quite some
time, we never used it in anger before yesterday. It's easy to
check that it works, but I wonder whether anybody has third-party
analysis tools that look at the catalog headers and would get broken
because they didn't cover this.
* If we change these markings, then our own testing in the buildfarm
etc. will not reflect the state of affairs seen in many/most actual
v10-v12 installations. The scope of code where it'd matter seems
pretty tiny, so I don't think there's a big risk, but there's more
than zero risk. (In any case, I would not push this part until all
the buildfarm JIT critters have reported happiness with 798b4faef,
as that's the one specific spot where it surely does matter.)
* On the other side of the ledger, if we don't fix these markings
we cannot back-patch the additional assertions I proposed at [1]/messages/by-id/298837.1595196283@sss.pgh.pa.us.
I'm kind of leaning to committing this as shown and back-patching
the patch at [1]/messages/by-id/298837.1595196283@sss.pgh.pa.us, but certainly a case could be made in the other
direction. Thoughts?
regards, tom lane
Attachments:
backpatchable-not-null-fixes.patchtext/x-diff; charset=us-ascii; name=backpatchable-not-null-fixes.patchDownload+35-8
I wrote:
* On the other side of the ledger, if we don't fix these markings
we cannot back-patch the additional assertions I proposed at [1].
I'm kind of leaning to committing this as shown and back-patching
the patch at [1], but certainly a case could be made in the other
direction. Thoughts?
After further thought about that I realized that the assertion patch
could be kluged in the same way as we did in llvmjit_deform.c, and
that that would really be the only safe way to do it pre-v13.
Otherwise the assertions would trip in pre-existing databases,
which would not be nice.
So what I've done is to back-patch the assertions that way, and
*not* apply BKI_FORCE_NULL in the back branches. The possible
downsides of doing that seem to outweigh the upside of making
the catalog state cleaner in new installations.
regards, tom lane