BUG #16644: null value for defaults in OLD variable for trigger
The following bug has been logged on the website:
Bug reference: 16644
Logged by: Fedor Erastov
Email address: fedor_erastov@mail.ru
PostgreSQL version: 13.0
Operating system: CentOS, MacOS
Description:
Start history:
https://postgresteam.slack.com/archives/C0FS3UTAP/p1601206489174900
Found weird postgres behavior (seems to work for >11 versions):
1. There is a table with data, and trigger before update for each row
2. Add a new column with not null default value
3. When trying to update the value in the old column, raise `ERROR: null
value in column violates not-null constraint`
Most likely this is because the default values in >11 versions are not
really put into the table when adding a column. And an important feature is
that if the trigger returns NEW, then there are no problems, and if OLD,
then an error appears. Although if you check these two variables, they will
be absolutely equal.
Full PoC:
create table test(a integer);
create or replace function set_updated_at_column() returns trigger
language plpgsql
as
$$
BEGIN
RAISE NOTICE 'OLD: %, NEW: %, COMPARE: %', OLD, NEW, OLD = NEW;
RETURN OLD;
END;
$$;
create trigger update_test
before update
on test
for each row
execute procedure set_updated_at_column();
insert into test values(1);
-- adds new column
alter table test add column b integer not null default 1;
-- fails with a not null constraint violation, which is not the case, since
the tuple is (1,1) not (1,null)
update test set a=1 where a=1;
Interesting observation: if you reassign the value of old.b old.b := old.b;
the error is gone.
With the help of the slack user @easteregg, it turned out to be possible to
find the first bad commit in which this error occurs, that would be:
https://github.com/postgres/postgres/commit/ff11e7f4b9ae017585c3ba146db7ba39c31f209a
In addition, I have a suspicion that it has something to do with work "lazy"
defaults
https://dataegret.com/2018/03/waiting-for-postgresql-11-pain-free-add-column-with-non-null-defaults/
PG Bug reporting form <noreply@postgresql.org> writes:
Found weird postgres behavior (seems to work for >11 versions):
1. There is a table with data, and trigger before update for each row
2. Add a new column with not null default value
3. When trying to update the value in the old column, raise `ERROR: null
value in column violates not-null constraint`
I confirm this bug in v12 and up. There's no visible bug in v11,
so the attrmissing feature (which was added in 11) is not solely
to blame. It's a component of the problem, though.
After tracing through it, what I find is that:
1. Where ExecBRUpdateTriggers fetches the "trigtuple" (line 2695
of trigger.c, in HEAD) what it gets is a raw copy of the old
on-disk tuple, with natts = 1. That gets passed to the trigger
function as OLD, but it reads correctly because it's being decoded
with the relation's tupdesc, which has the needed info to fill in
attrmissing columns.
2. When the trigger does "return OLD", trigtuple is what gets
passed back, and it gets jammed into the slot that ExecUpdate
passed to ExecBRUpdateTriggers. *That slot does not have any
attrmissing info*. Thus, subsequent examination of the slot,
in particular by ExecConstraints, concludes that the recently-
added column is NULL.
I've not tried to find where is the difference between 11 and
12 that makes it fail or not fail. It's likely some innocent
seeming aspect of the slot management hacking that Andres
was doing around that time. It's a bit odd though, because
AFAICS the slot in question is going to be the result slot
of the JunkFilter used by ExecModifyTable, and I'd rather have
expected that that slot would never have had any constraints,
in any version.
If you ask me, the proximate cause of this bug was the decision
to stick attrmissing info into the "constr" field of tupledescs.
That seems pretty damfool, because there are large parts of the
system that consider that the constr info is optional and can
be discarded, or never built to begin with. Perhaps it's too late
to revisit that, but we need to take a very very hard look at an
awful lot of places if we're to stick with that design.
I think we can band-aid this immediate problem by forcing
trigger.c to materialize the "old" tuples it fetches from disk
(or whatever needs to be done to substitute missing values into
them). I've got about zero faith that there aren't dozens of
similar bugs lurking, however.
regards, tom lane
On 9/29/20 10:21 PM, Tom Lane wrote:
If you ask me, the proximate cause of this bug was the decision
to stick attrmissing info into the "constr" field of tupledescs.
That seems pretty damfool, because there are large parts of the
system that consider that the constr info is optional and can
be discarded, or never built to begin with. Perhaps it's too late
to revisit that, but we need to take a very very hard look at an
awful lot of places if we're to stick with that design.
:-(
At nearly three years remove my memory is a bit hazy, but IIRC I looked
at all the places (or all I could identify) where it might have mattered
and it looked OK. But I'm not going to bet my life I caught everything,
and maybe I unintentionally provided for a future footgun, which from
your description this appears to be.
I'm not sure what the implications of changing it would be.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 30, 2020 at 11:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
After tracing through it, what I find is that:
1. Where ExecBRUpdateTriggers fetches the "trigtuple" (line 2695
of trigger.c, in HEAD) what it gets is a raw copy of the old
on-disk tuple, with natts = 1. That gets passed to the trigger
function as OLD, but it reads correctly because it's being decoded
with the relation's tupdesc, which has the needed info to fill in
attrmissing columns.2. When the trigger does "return OLD", trigtuple is what gets
passed back, and it gets jammed into the slot that ExecUpdate
passed to ExecBRUpdateTriggers. *That slot does not have any
attrmissing info*. Thus, subsequent examination of the slot,
in particular by ExecConstraints, concludes that the recently-
added column is NULL.I've not tried to find where is the difference between 11 and
12 that makes it fail or not fail. It's likely some innocent
seeming aspect of the slot management hacking that Andres
was doing around that time. It's a bit odd though, because
AFAICS the slot in question is going to be the result slot
of the JunkFilter used by ExecModifyTable, and I'd rather have
expected that that slot would never have had any constraints,
in any version.
In v11, GetTupleForTrigger() expands the HeapTuple, with this:
/*
* While this is not necessary anymore after 297d627e, as a defense
* against C code that has not recompiled for minor releases after the
* fix, continue to expand the tuple.
*/
if (HeapTupleHeaderGetNatts(tuple.t_data) < relation->rd_att->natts)
result = heap_expand_tuple(&tuple, relation->rd_att);
else
result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);
So the tuple that gets slammed into JunkFilter's result slot is an
expanded copy of the on-disk tuple, so it doesn't matter that its
TupleDesc doesn't have constraints (constr) initialized.
In v12 and further, I see that the tuple fetched by
GetTupleForTrigger() is slammed into JunkFilter's result slot
unexpanded and it remains so through ExecConstraint().
I think we can band-aid this immediate problem by forcing
trigger.c to materialize the "old" tuples it fetches from disk
(or whatever needs to be done to substitute missing values into
them).
Maybe something like the attached?
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
expand-trigger-tuple.patchapplication/octet-stream; name=expand-trigger-tuple.patchDownload+3-0
Amit Langote <amitlangote09@gmail.com> writes:
On Wed, Sep 30, 2020 at 11:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I've not tried to find where is the difference between 11 and
12 that makes it fail or not fail.
In v11, GetTupleForTrigger() expands the HeapTuple, with this:
/*
* While this is not necessary anymore after 297d627e, as a defense
* against C code that has not recompiled for minor releases after the
* fix, continue to expand the tuple.
*/
if (HeapTupleHeaderGetNatts(tuple.t_data) < relation->rd_att->natts)
result = heap_expand_tuple(&tuple, relation->rd_att);
else
result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);
Ah, good sleuthing.
I think we can band-aid this immediate problem by forcing
trigger.c to materialize the "old" tuples it fetches from disk
(or whatever needs to be done to substitute missing values into
them).
Maybe something like the attached?
Probably needs some attention to memory management (e.g.,
should_free_trig) but I'm okay with doing this as a short-term
fix. I remain worried about similar instances elsewhere, but
I have no good ideas about how to find them.
regards, tom lane
I wrote:
Amit Langote <amitlangote09@gmail.com> writes:
Maybe something like the attached?
Probably needs some attention to memory management (e.g.,
should_free_trig) but I'm okay with doing this as a short-term
fix.
I fixed the should_free business, and spent a fair amount of time
convincing myself that no other code paths in trigger.c need this,
and pushed it.
regards, tom lane
I wrote:
I fixed the should_free business, and spent a fair amount of time
convincing myself that no other code paths in trigger.c need this,
and pushed it.
No sooner had I pushed that than I thought of a potentially better
answer: why is it that the executor's slot hasn't got the right
descriptor, anyway? The reason is that ExecInitModifyTable is relying
on ExecInitJunkFilter, and thence ExecCleanTypeFromTL, to build that
descriptor. But we have the relation's *actual* descriptor right at
hand, and could use that instead. This saves a few cycles ---
ExecCleanTypeFromTL isn't enormously expensive, but it's not free
either. Moreover, it's more correct, even disregarding the problem
at hand, because the tlist isn't a perfectly accurate depiction of
the relation rowtype: ExecCleanTypeFromTL will not derive the correct
info for dropped columns.
We do have to refactor ExecInitJunkFilter a little to make this
possible, but it's not a big change. (I initially tried to use the
existing ExecInitJunkFilterConversion function, but that does the
wrong thing for attisdropped columns.) Otherwise, this reverts the
prior patch's code changes in triggers.c, but keeps the test case.
Thoughts? I'm inclined to leave the previous patch alone in the
back branches, because that has fewer potential side-effects,
but I like this better for HEAD.
regards, tom lane
Attachments:
use-the-right-descriptor-during-UPDATE-1.patchtext/x-diff; charset=us-ascii; name=use-the-right-descriptor-during-UPDATE-1.patchDownload+55-50
On Mon, Oct 26, 2020 at 5:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
I fixed the should_free business, and spent a fair amount of time
convincing myself that no other code paths in trigger.c need this,
and pushed it.
Thanks for that.
No sooner had I pushed that than I thought of a potentially better
answer: why is it that the executor's slot hasn't got the right
descriptor, anyway? The reason is that ExecInitModifyTable is relying
on ExecInitJunkFilter, and thence ExecCleanTypeFromTL, to build that
descriptor. But we have the relation's *actual* descriptor right at
hand, and could use that instead. This saves a few cycles ---
ExecCleanTypeFromTL isn't enormously expensive, but it's not free
either.
Agreed.
Moreover, it's more correct, even disregarding the problem
at hand, because the tlist isn't a perfectly accurate depiction of
the relation rowtype: ExecCleanTypeFromTL will not derive the correct
info for dropped columns.
Hmm, I don't understand. Isn't it the planner's job to make the
targetlist correctly account for dropped columns; what
expand_targetlist() does?
We do have to refactor ExecInitJunkFilter a little to make this
possible, but it's not a big change. (I initially tried to use the
existing ExecInitJunkFilterConversion function, but that does the
wrong thing for attisdropped columns.)
So, with the map generated by ExecInitJunkFilterConversion(), dropped
attributes of the target composite type are effectively mapped to a
NULL datum. That to me seems to be more or less the same thing as
ExecInitJunkFilterInsertion() mapping dropped attributes of the target
table to NULL datums in the source plan's targetlist. What am I
missing?
Otherwise, this reverts the
prior patch's code changes in triggers.c, but keeps the test case.Thoughts? I'm inclined to leave the previous patch alone in the
back branches, because that has fewer potential side-effects,
but I like this better for HEAD.
I do agree that this is a better fix for HEAD.
--
Amit Langote
EDB: http://www.enterprisedb.com
Amit Langote <amitlangote09@gmail.com> writes:
On Mon, Oct 26, 2020 at 5:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Moreover, it's more correct, even disregarding the problem
at hand, because the tlist isn't a perfectly accurate depiction of
the relation rowtype: ExecCleanTypeFromTL will not derive the correct
info for dropped columns.
Hmm, I don't understand. Isn't it the planner's job to make the
targetlist correctly account for dropped columns; what
expand_targetlist() does?
Yes, there are columns in the tlist to match them, but ExecCleanTypeFromTL
cannot mark those columns as "attisdropped". The column data type
likely won't be right either. The latter shouldn't matter, if the
column is being filled with a null ... but I'm a bit surprised that
we've gotten away this long with not being honest about attisdropped.
regards, tom lane
On Mon, Oct 26, 2020 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Langote <amitlangote09@gmail.com> writes:
On Mon, Oct 26, 2020 at 5:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Moreover, it's more correct, even disregarding the problem
at hand, because the tlist isn't a perfectly accurate depiction of
the relation rowtype: ExecCleanTypeFromTL will not derive the correct
info for dropped columns.Hmm, I don't understand. Isn't it the planner's job to make the
targetlist correctly account for dropped columns; what
expand_targetlist() does?Yes, there are columns in the tlist to match them, but ExecCleanTypeFromTL
cannot mark those columns as "attisdropped".
Ah, okay.
The column data type
likely won't be right either. The latter shouldn't matter, if the
column is being filled with a null ... but I'm a bit surprised that
we've gotten away this long with not being honest about attisdropped.
Yeah, I guess most places downstream of ExecModifyTable() seem to rely
on getting that information indirectly via the isnull flag of the
tuple itself.
--
Amit Langote
EDB: http://www.enterprisedb.com