BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Started by PG Bug reporting formover 3 years ago37 messagesbugs

noreply@postgresql.org

over 3 years ago

The following bug has been logged on the website:

Bug reference: 17798
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 15.2
Operating system: Ubuntu 22.04
Description:

When executing the following queries under valgrind:
cat << "EOF" | psql
CREATE TABLE bruttest(cnt int DEFAULT 0, t text);
CREATE FUNCTION noop_tfunc() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN
RETURN NEW; END$$;
CREATE TRIGGER brupdate_trigger BEFORE UPDATE ON bruttest FOR EACH ROW
EXECUTE FUNCTION noop_tfunc();
INSERT INTO bruttest(t) VALUES (repeat('x', 1000));
EOF

for ((i=1;i<=4;i++)); do
echo "iteration $i"
psql -c "UPDATE bruttest SET cnt = cnt + 1" &
psql -c "UPDATE bruttest SET cnt = cnt + 1" &
wait
done

I get a failure:
...
iteration 4
UPDATE 1
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

server.log contains:
==00:00:00:08.877 3381345== Invalid read of size 1
==00:00:00:08.877 3381345== at 0x1E4721: heap_compute_data_size
(heaptuple.c:138)
==00:00:00:08.877 3381345== by 0x1E5674: heap_form_tuple
(heaptuple.c:1061)
==00:00:00:08.877 3381345== by 0x413AF6: tts_buffer_heap_materialize
(execTuples.c:748)
==00:00:00:08.877 3381345== by 0x414CB9: ExecFetchSlotHeapTuple
(execTuples.c:1654)
==00:00:00:08.877 3381345== by 0x3DF10B: ExecBRUpdateTriggers
(trigger.c:3074)
==00:00:00:08.877 3381345== by 0x438E2C: ExecUpdatePrologue
(nodeModifyTable.c:1912)
==00:00:00:08.877 3381345== by 0x43A623: ExecUpdate
(nodeModifyTable.c:2269)
==00:00:00:08.877 3381345== by 0x43D8BD: ExecModifyTable
(nodeModifyTable.c:3856)
==00:00:00:08.877 3381345== by 0x40E7F1: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:08.877 3381345== by 0x406E19: ExecProcNode (executor.h:259)
==00:00:00:08.877 3381345== by 0x406E19: ExecutePlan (execMain.c:1636)
==00:00:00:08.877 3381345== by 0x406FF9: standard_ExecutorRun
(execMain.c:363)
==00:00:00:08.877 3381345== by 0x4070C5: ExecutorRun (execMain.c:307)
==00:00:00:08.877 3381345== Address 0x957d114 is in a rw- anonymous
segment
==00:00:00:08.877 3381345==
{
<insert_a_suppression_name_here>
...
==00:00:00:08.877 3381345==
==00:00:00:08.877 3381345== Exit program on first error
(--exit-on-first-error=yes)
2023-02-17 11:27:17.663 MSK [3381268] LOG: server process (PID 3381345)
exited with exit code 1
2023-02-17 11:27:17.663 MSK [3381268] DETAIL: Failed process was running:
UPDATE bruttest SET cnt = cnt + 1

I've also seen a relevant failure detected by asan (on master):
==1912964==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x7f223dd52c78 at pc 0x55c9e846146a bp 0x7fff0c63e560 sp 0x7fff0c63dd30
WRITE of size 42205122 at 0x7f223dd52c78 thread T0
...
#8 0x000055c9e846953f in __asan::ReportGenericError(unsigned long, unsigned
long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool)
()
#9 0x000055c9e846148c in __asan_memcpy ()
#10 0x000055c9e84d09c5 in fill_val (att=<optimized out>, bit=<optimized
out>, bitmask=bitmask@entry=0x7fff0c63e620,
dataP=dataP@entry=0x7fff0c63e5e0,
infomask=infomask@entry=0x7f223dd09864, datum=<optimized out>,
isnull=<optimized out>) at heaptuple.c:270
#11 0x000055c9e84d0080 in heap_fill_tuple (tupleDesc=<optimized out>,
values=<optimized out>, isnull=<optimized out>,
data=<optimized out>, data_size=<optimized out>, infomask=<optimized
out>, bit=<optimized out>) at heaptuple.c:336
#12 0x000055c9e84d530e in heap_form_tuple (tupleDescriptor=0x7f22898f76e8,
values=0x1d3084, isnull=0x6250000c85d8)
at heaptuple.c:1090
#13 0x000055c9e8bc8d0e in tts_buffer_heap_materialize (slot=0x6250000c8548)
at execTuples.c:748
#14 0x000055c9e8bcd02a in ExecFetchSlotHeapTuple
(slot=slot@entry=0x6250000c8548, materialize=<optimized out>,
shouldFree=shouldFree@entry=0x7fff0c63e810) at execTuples.c:1654
#15 0x000055c9e8afbd9a in ExecBRUpdateTriggers (estate=<optimized out>,
epqstate=<optimized out>,
relinfo=0x6250000a2e80, tupleid=<optimized out>,
fdw_trigtuple=<optimized out>, newslot=<optimized out>,
tmfd=<optimized out>) at trigger.c:3022
...

According to git bisect, first bad commit for this anomaly is 86dc90056.

That commit changed just one thing in trigger.c:
-                       epqslot_clean =
ExecFilterJunk(relinfo->ri_junkFilter, epqslot_candidate);
+                       epqslot_clean = ExecGetUpdateNewTuple(relinfo,
epqslot_candidate,
+                                                                           
                     oldslot);

I've found the following explanation for the failure:
1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
linked together (their slot->tts_values[1] in this case point to the same
memory address (inside the oldslot' buffer)).
2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
so the oldslot would be the only user of the buffer at that moment.
(The newslot doesn't become an official user of the buffer.)
3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
tts_buffer_heap_materialize() where the oldslot->buffer is released.
4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
tts_buffer_heap_materialize() where an incorrect access to memory
that became anonymous occurs, and that is detected by valgrind.
If not detected, different consequences are possible (in the asan case
it was memcpy with an incorrectly read extra large data_len).

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

17.02.2023 13:00, PG Bug reporting form wrote:

The following bug has been logged on the website:
...

I've found the following explanation for the failure:
1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
linked together (their slot->tts_values[1] in this case point to the same
memory address (inside the oldslot' buffer)).
2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
so the oldslot would be the only user of the buffer at that moment.
(The newslot doesn't become an official user of the buffer.)
3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
tts_buffer_heap_materialize() where the oldslot->buffer is released.
4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
tts_buffer_heap_materialize() where an incorrect access to memory
that became anonymous occurs, and that is detected by valgrind.
If not detected, different consequences are possible (in the asan case
it was memcpy with an incorrectly read extra large data_len).

I've tried to materialize newslot before the oldslot materialization
(in ExecFetchSlotHeapTuple(), where their common memory is released),
and it looks like it fixes the issue.
The similar thing done in 75e03eabe.
But I don't know the code good enough so maybe I'm missing something.

Best regards,
Alexander

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Alexander Lakhin (#2)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

07.03.2023 09:00, Alexander Lakhin wrote:

I've tried to materialize newslot before the oldslot materialization
(in ExecFetchSlotHeapTuple(), where their common memory is released),
and it looks like it fixes the issue.

I've made a simple isolation test to illustrate the bug, which I'd consider
as serious. On master it shows (under valgrind):
# using temp instance on port 61696 with PID 614130
not ok 1 - bru-trigger 2147 ms
# (test process exited with exit code 1)

src/test/isolation/output_iso/log/postmaster.log contains:
...
==00:00:00:05.840 615284== Invalid read of size 1
==00:00:00:05.840 615284==    at 0x1E376C: heap_compute_data_size (heaptuple.c:147)
==00:00:00:05.840 615284==    by 0x1E4458: heap_form_tuple (heaptuple.c:1061)
==00:00:00:05.840 615284==    by 0x3DB74A: tts_buffer_heap_materialize (execTuples.c:749)
==00:00:00:05.840 615284==    by 0x3DC5EB: ExecFetchSlotHeapTuple (execTuples.c:1655)
==00:00:00:05.840 615284==    by 0x3A6BA7: ExecBRUpdateTriggers (trigger.c:3032)
==00:00:00:05.840 615284==    by 0x3FE207: ExecUpdatePrologue (nodeModifyTable.c:1916)
==00:00:00:05.840 615284==    by 0x3FF838: ExecUpdate (nodeModifyTable.c:2289)
==00:00:00:05.840 615284==    by 0x401BD4: ExecModifyTable (nodeModifyTable.c:3795)
==00:00:00:05.840 615284==    by 0x3D65FF: ExecProcNodeFirst (execProcnode.c:464)
==00:00:00:05.840 615284==    by 0x3CE4F5: ExecProcNode (executor.h:272)
==00:00:00:05.840 615284==    by 0x3CE585: ExecutePlan (execMain.c:1633)
==00:00:00:05.840 615284==    by 0x3CF220: standard_ExecutorRun (execMain.c:364)
...
2023-04-01 14:26:31.543 MSK postmaster[615243] LOG: server process (PID 615284) exited with exit code 1
2023-04-01 14:26:31.543 MSK postmaster[615243] DETAIL: Failed process was running: UPDATE bruttest SET cnt = cnt + 1;

Maybe the test could supplement a fix (I'm still unsure how to fix the issue
right way).

Best regards,
Alexander

Richard Guo

guofenglinux@gmail.com

about 3 years ago

In reply to: Alexander Lakhin (#2)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

On Tue, Mar 7, 2023 at 2:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

I've found the following explanation for the failure:
1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
linked together (their slot->tts_values[1] in this case point to the

same

memory address (inside the oldslot' buffer)).
2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
so the oldslot would be the only user of the buffer at that moment.
(The newslot doesn't become an official user of the buffer.)
3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
tts_buffer_heap_materialize() where the oldslot->buffer is released.
4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
tts_buffer_heap_materialize() where an incorrect access to memory
that became anonymous occurs, and that is detected by valgrind.
If not detected, different consequences are possible (in the asan case
it was memcpy with an incorrectly read extra large data_len).

I've tried to materialize newslot before the oldslot materialization
(in ExecFetchSlotHeapTuple(), where their common memory is released),
and it looks like it fixes the issue.

Reproduced this issue on master with your queries. I looked into this
issue and I agree with your analysis. I think this is exactly what
happened.

I also agree that we should materialize the newslot before we fetch
trigtuple from the oldslot which would materialize the oldslot and
release all buffer pins. But I'm not too familiar with the arounding
codes so need someone else to have a look.

Thanks
Richard

Richard Guo

guofenglinux@gmail.com

about 3 years ago

In reply to: Richard Guo (#4)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

On Mon, Apr 3, 2023 at 2:29 PM Richard Guo <guofenglinux@gmail.com> wrote:

Reproduced this issue on master with your queries. I looked into this
issue and I agree with your analysis. I think this is exactly what
happened.

I also agree that we should materialize the newslot before we fetch
trigtuple from the oldslot which would materialize the oldslot and
release all buffer pins. But I'm not too familiar with the arounding
codes so need someone else to have a look.

I have a second look at this issue and now I think the fix in v1 patch
is correct. I think the comment needs to be updated for this change,
maybe something like

  * (Typically, newslot was also generated by ExecGetUpdateNewTuple, so
  * that epqslot_clean will be that same slot and the copy step below
- * is not needed.)
+ * is not needed.  And we need to materialize newslot in this case,
+ * since its tuple might be dependent on oldslot's storage, which
+ * might not be a local copy and be freed before we fetch newslot's
+ * tuple.)

Thanks
Richard

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Richard Guo (#5)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Richard,

19.04.2023 11:29, Richard Guo wrote:

I have a second look at this issue and now I think the fix in v1 patch
is correct. I think the comment needs to be updated for this change,
maybe something like

  * (Typically, newslot was also generated by ExecGetUpdateNewTuple, so
  * that epqslot_clean will be that same slot and the copy step below
- * is not needed.)
+ * is not needed.  And we need to materialize newslot in this case,
+ * since its tuple might be dependent on oldslot's storage, which
+ * might not be a local copy and be freed before we fetch newslot's
+ * tuple.)

Thanks for looking into this!

I've updated the comment based on your suggestion. Please look at the patch
attached.

I hesitate to add the isolation test to the patch as adding a dedicated
test just for this case seems too wasteful to me. Maybe that scenario could
be added to one of the existing trigger-related tests, but I couldn't find a
test, which is generic enough for that.

Best regards,
Alexander

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Alexander Lakhin (#6)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Tom,

20.04.2023 13:00, Alexander Lakhin wrote:

19.04.2023 11:29, Richard Guo wrote:

I have a second look at this issue and now I think the fix in v1 patch
is correct. I think the comment needs to be updated for this change,
maybe something like

I've updated the comment based on your suggestion. Please look at the patch
attached.

Could you please confirm the issue, Richard's and my conclusions, and the
correctness of the patch, in light of the upcoming May releases?
Maybe I'm wrong, but I think that this bug can lead to data corruption in
databases where BRU triggers are used.

If the defect is real, but the patch proposed is not good enough, I would
register it on the commitfest to continue working on it.

Best regards,
Alexander

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Alexander Lakhin (#7)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Alexander Lakhin <exclusion@gmail.com> writes:

Could you please confirm the issue, Richard's and my conclusions, and the
correctness of the patch, in light of the upcoming May releases?
Maybe I'm wrong, but I think that this bug can lead to data corruption in
databases where BRU triggers are used.

Clearly it could, though the probability seems low since the just-released
buffer would have to get recycled very quickly.

I am not entirely comfortable with blaming this on 86dc90056 though,
even though "git bisect" seems to finger that. The previous coding
there with ExecFilterJunk() also produced a virtual tuple, as you
can see by examining ExecFilterJunk, so why didn't the problem
manifest before? I think the answer may be that before 86dc90056,
we were effectively relying on the underlying SeqScan node to keep
the buffer pinned, but now we're re-fetching the tuple and no pin
is held by lower plan nodes. I'm not entirely sure about that though;
ISTM a SeqScan ought to hold pin on its current tuple in any case.
However, if the UPDATE plan were more complicated (involving a join)
then it's quite believable that we'd get here with no relevant pin
being held.

The practical impact of that concern is that I'm not convinced that
the proposed patch fixes all variants of the issue. Maybe we need
to materialize even if newslot != epqslot_clean. Maybe we need to
do it even if there wasn't a concurrent update. Maybe we need to
do it in pre-v14 branches, too. It's certainly not obvious that this
function ought to assume anything about which slots are pointing at
what.

Also, if we do materialize here, does this code a little further down
become redundant?

if (should_free_trig && newtuple == trigtuple)
ExecMaterializeSlot(newslot);

A completely different way of looking at it is that we should not
treat this as ExecBRUpdateTriggers's bug at all. The slot mechanisms
are supposed to protect the data referenced by a slot, so why is that
failing to happen in this example? The correct fix might involve
newslot acquiring a buffer pin, for example.

Bottom line is that I'm afraid the present patch is band-aiding a
specific problem without solving the general case.

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Tom Lane (#8)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

30.04.2023 01:24, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

Could you please confirm the issue, Richard's and my conclusions, and the
correctness of the patch, in light of the upcoming May releases?
Maybe I'm wrong, but I think that this bug can lead to data corruption in
databases where BRU triggers are used.

Clearly it could, though the probability seems low since the just-released
buffer would have to get recycled very quickly.

So I've made several mistakes when trying to reach the solution walking on
thin ice. I'd initially observed the issue with asan (and perhaps that had
happened under some specific conditions (memory pressure? (I can't say
now)), but when I produced and simplified the repro for the bug report,
I was focused on valgrind, so the final repro doesn't demonstrate a real
memory corruption and it's not clear now, which conditions needed to cause it.
So I really can't estimate the probability of the corruption.

I am not entirely comfortable with blaming this on 86dc90056 though,
even though "git bisect" seems to finger that. The previous coding
there with ExecFilterJunk() also produced a virtual tuple, as you
can see by examining ExecFilterJunk, so why didn't the problem
manifest before? I think the answer may be that before 86dc90056,
we were effectively relying on the underlying SeqScan node to keep
the buffer pinned, but now we're re-fetching the tuple and no pin
is held by lower plan nodes. I'm not entirely sure about that though;
ISTM a SeqScan ought to hold pin on its current tuple in any case.
However, if the UPDATE plan were more complicated (involving a join)
then it's quite believable that we'd get here with no relevant pin
being held.

Yeah, also the environment changed since 86dc90056, so I couldn't just revert
that commit to test the ExecFilterJunk() behavior on master.

All the questions you raised require a more thorough investigation.

The practical impact of that concern is that I'm not convinced that
the proposed patch fixes all variants of the issue. Maybe we need
to materialize even if newslot != epqslot_clean. Maybe we need to
do it even if there wasn't a concurrent update. Maybe we need to
do it in pre-v14 branches, too. It's certainly not obvious that this
function ought to assume anything about which slots are pointing at
what.

Also, if we do materialize here, does this code a little further down
become redundant?

if (should_free_trig && newtuple == trigtuple)
ExecMaterializeSlot(newslot);

A completely different way of looking at it is that we should not
treat this as ExecBRUpdateTriggers's bug at all. The slot mechanisms
are supposed to protect the data referenced by a slot, so why is that
failing to happen in this example? The correct fix might involve
newslot acquiring a buffer pin, for example.

Yes, it was tough to decide how to fix it (and the correct buffer pinning
seemed more plausible), but 75e03eabe gave me hope that it could be fixed
by the simple one-line change.

Bottom line is that I'm afraid the present patch is band-aiding a
specific problem without solving the general case.

I agree with your analysis and conclusions, so I'm going to turn the alarm
mode off and research the issue deeper.

Thank you for the detailed answer!

Best regards,
Alexander

#10

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Alexander Lakhin (#9)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

On Sun, Apr 30, 2023 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Yeah, also the environment changed since 86dc90056, so I couldn't just revert
that commit to test the ExecFilterJunk() behavior on master.

All the questions you raised require a more thorough investigation.

Are you aware of the fact that Valgrind has custom instrumentation
that makes it directly capable of detecting access to no-longer-pinned
buffers? See commits 7b7ed046 and 1e0dfd16.

I think that that may be a factor here. If it is, then it's a little
surprising that you ever found the problem with ASAN, since of course
we don't have custom ASAN instrumentation that tells ASAN things like
"until I say otherwise, it is not okay to read from this range of
memory, which is this backend's memory mapping for an individual
shared buffer that was just unpinned".

--
Peter Geoghegan

#11

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#8)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hi,

On 2023-04-29 18:24:47 -0400, Tom Lane wrote:

A completely different way of looking at it is that we should not
treat this as ExecBRUpdateTriggers's bug at all. The slot mechanisms
are supposed to protect the data referenced by a slot, so why is that
failing to happen in this example? The correct fix might involve
newslot acquiring a buffer pin, for example.

I think the slot mechanism today (and historically) doesn't protect against
quite a few such scenarios - and it's not clear how to change that. Most of
the scenarios where we need to materialize are because the underlying memory
is going to be released / reset, i.e. we don't hold a buffer pin in the first
place. I guess we could try to protect against that by registering a memory
context hook, but that'd be a very heavyweight mechanism.

Greetings,

Andres Freund

#12

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Peter Geoghegan (#10)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Peter,

30.04.2023 23:39, Peter Geoghegan wrote:

On Sun, Apr 30, 2023 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Yeah, also the environment changed since 86dc90056, so I couldn't just revert
that commit to test the ExecFilterJunk() behavior on master.

All the questions you raised require a more thorough investigation.

Are you aware of the fact that Valgrind has custom instrumentation
that makes it directly capable of detecting access to no-longer-pinned
buffers? See commits 7b7ed046 and 1e0dfd16.

Yes, I am. That's why I can't demonstrate real risks now — my repro
triggers a Valgrind warning only, and no other consequences are visible.

I think that that may be a factor here. If it is, then it's a little
surprising that you ever found the problem with ASAN, since of course
we don't have custom ASAN instrumentation that tells ASAN things like
"until I say otherwise, it is not okay to read from this range of
memory, which is this backend's memory mapping for an individual
shared buffer that was just unpinned".

The ASAN detected a consequence of the invalid memory read:
==1912964==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x7f223dd52c78 at pc 0x55c9e846146a bp 0x7fff0c63e560 sp 0x7fff0c63dd30
WRITE of size 42205122 at 0x7f223dd52c78 thread T0
...
#10 0x000055c9e84d09c5 in fill_val (...) at heaptuple.c:270
...
Note the abnormal size 42205122 (the target table contained much shorter
values at the moment). When exploring the coredump and the source code,
I found that that invalid size was gotten as follows:
ExecBRUpdateTriggers(): newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);
ExecFetchSlotHeapTuple(): slot->tts_ops->materialize(slot)
tts_buffer_heap_materialize(): bslot->base.tuple = heap_form_tuple(...)
heap_form_tuple(): data_len = heap_compute_data_size(tupleDescriptor, values, isnull);
heap_compute_data_size(): here garbage data were read.

So I'm going to start it over and find a repro that will show something
interesting without Valgrind (and let estimate the probability of a
problem in the field).

Best regards,
Alexander

#13

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Alexander Lakhin (#12)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Alexander Lakhin <exclusion@gmail.com> writes:

So I'm going to start it over and find a repro that will show something
interesting without Valgrind (and let estimate the probability of a
problem in the field).

I think a probability estimate is not all that exciting: "it's low
but not zero" is sufficient information to act on. What I am curious
about is how come 86dc90056's changes triggered a visible problem.
Understanding that might give us a better feel for what is the real
structural issue here --- which I'm worried about because I have
little confidence that there aren't similar bugs lurking in other
code paths.

regards, tom lane

#14

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Alexander Lakhin (#12)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

On Sun, Apr 30, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

Yes, I am. That's why I can't demonstrate real risks now — my repro
triggers a Valgrind warning only, and no other consequences are visible.

It would be nice to be able to say something about the severity with
high confidence, but this seems like the kind of thing that rarely
goes that way. Does it really matter? A bug is a bug.

--
Peter Geoghegan

#15

Alexander Lakhin

exclusion@gmail.com

about 3 years ago

In reply to: Tom Lane (#13)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

01.05.2023 07:04, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

So I'm going to start it over and find a repro that will show something
interesting without Valgrind (and let estimate the probability of a
problem in the field).

I think a probability estimate is not all that exciting: "it's low
but not zero" is sufficient information to act on. What I am curious
about is how come 86dc90056's changes triggered a visible problem.
Understanding that might give us a better feel for what is the real
structural issue here --- which I'm worried about because I have
little confidence that there aren't similar bugs lurking in other
code paths.

I've found one significant difference between old and new code paths.
On 86dc90056 I see the following call chain:
ExecGetUpdateNewTuple(planSlot = epqslot_candidate, oldSlot = oldslot):
econtext->ecxt_scantuple = oldSlot;
-> internalGetUpdateNewTuple() -> ExecProject(projInfo=relinfo->ri_projectNew)
-> ExecEvalExprSwitchContext(state=projInfo->pi_state, econtext=projInfo->pi_exprContext)
-> state->evalfunc(state, econtext, isNull);
-> ExecInterpExpr():
scanslot = econtext->ecxt_scantuple;
...
EEOP_SCAN_FETCHSOME: slot_getsomeattrs(scanslot, op->d.fetch.last_var);

So someattrs loaded into oldslot, which is released after
trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig)

But before 86dc90056 we had:
ExecFilterJunk(junkfilter=relinfo->ri_junkFilter, slot=epqslot_candidate):
-> slot_getallattrs(slot);
-> slot_getsomeattrs(slot, slot->tts_tupleDescriptor->natts);

I. e., someattrs were loaded into the epqslot_candidate slot,
which wasn't touched by ExecFetchSlotHeapTuple(oldslot,...) and
the newslot's contents survived the call.

It's not a final conclusion, but maybe it could be helpful for someone who
investigates it too.

Best regards,
Alexander

#16

Alexander Lakhin

exclusion@gmail.com

almost 3 years ago

In reply to: Tom Lane (#8)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Tom,

I've found enough time this week to investigate the issue deeper and now
I can answer your questions, hopefully.

30.04.2023 01:24, Tom Lane wrote:

I am not entirely comfortable with blaming this on 86dc90056 though,
even though "git bisect" seems to finger that. The previous coding
there with ExecFilterJunk() also produced a virtual tuple, as you
can see by examining ExecFilterJunk, so why didn't the problem
manifest before? I think the answer may be that before 86dc90056,
we were effectively relying on the underlying SeqScan node to keep
the buffer pinned, but now we're re-fetching the tuple and no pin
is held by lower plan nodes. I'm not entirely sure about that though;
ISTM a SeqScan ought to hold pin on its current tuple in any case.
However, if the UPDATE plan were more complicated (involving a join)
then it's quite believable that we'd get here with no relevant pin
being held.

I couldn't confirm that the issue is specific to a SeqScan plan; it is
reproduced (and fixed by the patch proposed) with a more complex query:
cat << "EOF" | psql
CREATE TABLE bruttest(id int, cnt int DEFAULT 0, t text);
CREATE FUNCTION noop_tfunc() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN
RETURN NEW; END$$;
CREATE TRIGGER brupdate_trigger BEFORE UPDATE ON bruttest FOR EACH ROW
EXECUTE FUNCTION noop_tfunc();
INSERT INTO bruttest(id, t) VALUES (1, repeat('x', 1000));

CREATE TABLE t2(id int, delta int);
INSERT INTO t2 VALUES(1, 1);
EOF

for ((i=1;i<=4;i++)); do
echo "iteration $i"
psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
wait
done

---
psql -c "EXPLAIN UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id"
                                  QUERY PLAN
-------------------------------------------------------------------------------
Update on bruttest (cost=241.88..485.18 rows=0 width=0)
   -> Merge Join (cost=241.88..485.18 rows=13560 width=16)
         Merge Cond: (bruttest.id = t2.id)
         -> Sort (cost=83.37..86.37 rows=1200 width=14)
               Sort Key: bruttest.id
               -> Seq Scan on bruttest (cost=0.00..22.00 rows=1200 width=14)
         -> Sort (cost=158.51..164.16 rows=2260 width=14)
               Sort Key: t2.id
               -> Seq Scan on t2 (cost=0.00..32.60 rows=2260 width=14)
(9 rows)

IIUC, the buffer is pinned by oldslot only, and newslot uses it without
explicit pinning because of the following call chain:
epqslot_clean = ExecGetUpdateNewTuple(relinfo=relinfo, planSlot=epqslot_candidate, oldSlot=oldslot):
   ProjectionInfo *newProj = relinfo->ri_projectNew;
   ...
   econtext->ecxt_scantuple = oldSlot;
   ...
   ExecProject(projInfo=newProj):
        ExprState *state = &projInfo->pi_state;
        ...
        ExecEvalExprSwitchContext(state=state, econtext=projInfo->pi_exprContext):
            state->evalfunc(state, econtext, isNull) -> ExecInterpExpr(state, econtext, isNull):
                resultslot = state->resultslot;
                ...
                scanslot = econtext->ecxt_scantuple;
                ...
                EEOP_SCAN_FETCHSOME:
                    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
                ...
                EEOP_ASSIGN_SCAN_VAR:
                    resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

Note that ExecGetUpdateNewTuple() returns relinfo->ri_projectNew->pi_state.resultslot;
So the only place where the buffer could be pinned for newslot correctly is
EEOP_ASSIGN_SCAN_VAR, but I think that changing something there is not an option.

The practical impact of that concern is that I'm not convinced that
the proposed patch fixes all variants of the issue. Maybe we need
to materialize even if newslot != epqslot_clean.

I investigated all code paths and came to a conclusion that the case
newslot != epqslot_clean is impossible.

1) When ExecBRUpdateTriggers() called via ExecUpdatePrologue(), ...,
ExecModifyTable(), we have:
ExecModifyTable():
    /* Initialize projection info if first time for this table */
    if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
        ExecInitUpdateProjection(node, resultRelInfo);

    slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
...
    ExecUpdate(..., resultRelInfo=resultRelInfo, ..., slot=slot, ...):
        ExecUpdatePrologue(..., resultRelInfo=resultRelInfo, ... , ..., slot=slot, ...):
            ExecMaterializeSlot(slot);
            ...
            ExecBRUpdateTriggers(..., relinfo=resultRelInfo, ..., newslot=slot, ...)
                epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
(so, epqslot_clean and newslot both are equal to
relinfo->ri_projectNew->pi_state.resultslot)

2) When ExecBRUpdateTriggers() called via ExecMergeMatched(), .. .,
ExecMerge(), epqslot_candidate == NULL always, because of
                    /*
                     * Recheck the tuple using EPQ. For MERGE, we leave this
                     * to the caller (it must do additional rechecking, and
                     * might end up executing a different action entirely).
                     */
inside GetTupleForTrigger().

3) When ExecBRUpdateTriggers() called via ExecSimpleRelationUpdate(), ...,
apply_handle_tuple_routing():
FindReplTupleInLocalRel()
...
ExecSimpleRelationUpdate()
A concurrent update of target tuple prevented by table_tuple_lock() in
RelationFindReplTupleSeq()/RelationFindReplTupleByIndex(), called from
FindReplTupleInLocalRel().

Moreover, if epqslot_candidate != NULL were true when ExecBRUpdateTriggers()
called from ExecSimpleRelationUpdate()/ExecMergeMatched(), then we would
get a crash inside
epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
because of
ExecGetUpdateNewTuple(relinfo, ...)
        ProjectionInfo *newProj = relinfo->ri_projectNew;
...
        Assert(relinfo->ri_projectNewInfoValid);
...
        econtext = newProj->pi_exprContext;

as ExecInitUpdateProjection(), which initializes ri_projectNew, called only
from ExecModifyTable().

So I think it's safe to assume that newslot == epqslot_clean when
epqslot_candidate != NULL, thus ExecCopySlot may be removed.

Maybe we need to
do it even if there wasn't a concurrent update.

Without a concurrent update, i. e. when epqslot_candidate == NULL,
newslot is filled earlier, in ExecModifyTable():
oldSlot = resultRelInfo->ri_oldTupleSlot;
...
slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
so newslot uses oldSlot existing in ExecModifyTable(), not oldslot created
in ExecBRUpdateTriggers().

Maybe we need to
do it in pre-v14 branches, too.

As I noted in my previous message, before 86dc90056 a different slot was
used when getting "some attrs", so I see no need to fix that state.

It's certainly not obvious that this
function ought to assume anything about which slots are pointing at
what.

I think the problem is in the sequence of calls:

epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);

As ExecGetUpdateNewTuple() can fill attributes in oldslot and
ExecFetchSlotHeapTuple() can release the oldslot's underlying storage (as
the comment for that function says), ExecMaterializeSlot() should be called
in between to save the new tuple contents.

Also, if we do materialize here, does this code a little further down
become redundant?

if (should_free_trig && newtuple == trigtuple)
ExecMaterializeSlot(newslot);

That is another interesting case, and I think it's the separate one. I
could not reproduce that issue (commit 75e03eabea from 2019 mentions
zheap), but I suppose it was real at the moment of the commit, and it was
two years before 86dc90056. So I think that ExecMaterializeSlot() inside
the condition "if (epqslot_candidate != NULL)" would not cover that case.

As an outcome, I propose an improved patch, that:
1) Removes unreachable ExecCopySlot() call;
2) Adds one ExecMaterializeSlot() call, that covers both cases;
3) Removes separate ExecMaterializeSlot() call, that becomes redundant.

It simplifies ExecBRUpdateTriggers() a little, but if it might affect
performance significantly (I don't see how), maybe leave previous
ExecMaterializeSlot() call intact, and add another one just after
epqslot_clean = ExecGetUpdateNewTuple(...);

A completely different way of looking at it is that we should not
treat this as ExecBRUpdateTriggers's bug at all. The slot mechanisms
are supposed to protect the data referenced by a slot, so why is that
failing to happen in this example? The correct fix might involve
newslot acquiring a buffer pin, for example.

Unfortunately, I could not find a place where such pin could be acquired
(except for EEOP_ASSIGN_SCAN_VAR at the lower level).

Bottom line is that I'm afraid the present patch is band-aiding a
specific problem without solving the general case.

The next one is more generic, but still limited to the ExecBRUpdateTriggers()
bounds, as I still see no problem outside.

Best regards,
Alexander

#17

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#16)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Tom,

09.07.2023 18:00, Alexander Lakhin wrote:

30.04.2023 01:24, Tom Lane wrote:

Bottom line is that I'm afraid the present patch is band-aiding a
specific problem without solving the general case.

The next one is more generic, but still limited to the ExecBRUpdateTriggers()
bounds, as I still see no problem outside.

If this issue is not approaching the top of your todo list yet and you don't
mind, I'll queue this patch on the commitfest.

Best regards,
Alexander

#18

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Alexander Lakhin (#17)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Alexander Lakhin <exclusion@gmail.com> writes:

If this issue is not approaching the top of your todo list yet and you don't
mind, I'll queue this patch on the commitfest.

Please do, along with anything else that you're worried will get lost.

regards, tom lane

#19

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#16)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

On Sun, Jul 9, 2023 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I've found enough time this week to investigate the issue deeper and now
I can answer your questions, hopefully.

This patch lacks a commit message, which I think would be a good thing
to add. Even if somebody doesn't use the message you write, it would
help with understanding the patch itself. Deleting the call to
ExecMaterializeSlot with no explanation in the patch file of why
that's OK to do is not ideal. But the real heart of this small patch
seems to be this:

+               /*
+                * We need to materialize newslot because 1) it might
share oldslot's,
+                * buffer, which might be released on fetching
trigtuple from oldslot
+                * if oldslot is the only owner of buffer,
+                * and 2) if some trigger returns/stores the old trigtuple,
+                * and the heap tuple passed to the trigger was located locally,
+                * newslot might reference that tuple storage, which
is freed at the
+                * end of this function.
+                */

Reading this, I found (1) to be very surprising. I would expect that
advancing the scan would potentially release the buffer pin, but
fetching the tuple shouldn't advance the scan. I guess this is
referring to the call, just below, to ExecFetchSlotHeapTuple. I'm
guessing that can end up calling tts_buffer_heap_materialize which,
after materializing copying the tuple, thinks that it's OK to release
the buffer pin. But that makes me wonder ... how exactly do oldslot
and newslot end up sharing the same buffer without holding two pins on
it? tts_heap_copyslot() looks like it basically forces the tuple to be
materialized first and then copies that, so you'd end up with, I
guess, no buffer pins at all, rather than 1 or 2.

I'm obviously missing something important here...

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Robert Haas (#19)

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

Hello Robert,

06.01.2024 00:18, Robert Haas wrote:

On Sun, Jul 9, 2023 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I've found enough time this week to investigate the issue deeper and now
I can answer your questions, hopefully.

This patch lacks a commit message, which I think would be a good thing
to add. Even if somebody doesn't use the message you write, it would
help with understanding the patch itself. Deleting the call to
ExecMaterializeSlot with no explanation in the patch file of why
that's OK to do is not ideal.

Thanks for looking into this!

I've added a commit message, please look at the new patch version.

But the real heart of this small patch
seems to be this:
+               /*
+                * We need to materialize newslot because 1) it might
share oldslot's,
+                * buffer, which might be released on fetching
trigtuple from oldslot
+                * if oldslot is the only owner of buffer,
+                * and 2) if some trigger returns/stores the old trigtuple,
+                * and the heap tuple passed to the trigger was located locally,
+                * newslot might reference that tuple storage, which
is freed at the
+                * end of this function.
+                */
Reading this, I found (1) to be very surprising. I would expect that
advancing the scan would potentially release the buffer pin, but
fetching the tuple shouldn't advance the scan. I guess this is
referring to the call, just below, to ExecFetchSlotHeapTuple. I'm
guessing that can end up calling tts_buffer_heap_materialize which,
after materializing copying the tuple, thinks that it's OK to release
the buffer pin.

Yes, this is what happening there as I diagnosed in [1]/messages/by-id/17798-0907404928dcf0dd@postgresql.org.

But that makes me wonder ... how exactly do oldslot
and newslot end up sharing the same buffer without holding two pins on
it? tts_heap_copyslot() looks like it basically forces the tuple to be
materialized first and then copies that, so you'd end up with, I
guess, no buffer pins at all, rather than 1 or 2.

I'm obviously missing something important here...

As my analysis [2]/messages/by-id/e989408c-1838-61cd-c968-1fcf47c6fbba@gmail.com showed, epqslot_clean is always equal to newslot, so
ExecCopySlot() isn't called at all.

[1]: /messages/by-id/17798-0907404928dcf0dd@postgresql.org
[2]: /messages/by-id/e989408c-1838-61cd-c968-1fcf47c6fbba@gmail.com

Best regards,
Alexander

#21

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#20)

#22

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Robert Haas (#21)