pgsql: Allow using the updated tuple while moving it to a different par

Started by Amit Kapilaalmost 8 years ago8 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

Allow using the updated tuple while moving it to a different partition.

An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain. Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.

In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.

Reported-by: Pavan Deolasee
Author: Amit Khandekar, with minor changes by me
Reviewed-by: Dilip Kumar, Amit Kapila and Alvaro Herrera
Backpatch-through: 11
Discussion: /messages/by-id/CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/40ca70ebcc9d0bc1c02937b27c44b2766617e790

Modified Files
--------------
src/backend/commands/trigger.c | 22 ++++++-
src/backend/executor/execReplication.c | 2 +-
src/backend/executor/nodeModifyTable.c | 73 +++++++++++++++------
src/include/commands/trigger.h | 3 +-
.../isolation/expected/partition-key-update-4.out | 60 +++++++++++++++++
src/test/isolation/isolation_schedule | 1 +
.../isolation/specs/partition-key-update-4.spec | 76 ++++++++++++++++++++++
7 files changed, 214 insertions(+), 23 deletions(-)

#2Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#1)
Re: pgsql: Allow using the updated tuple while moving it to a different par

On Thu, Jul 12, 2018 at 07:33:54AM +0000, Amit Kapila wrote:

Allow using the updated tuple while moving it to a different partition.

An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain. Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.

In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.

Looks unrelated, but flaviventris (a snake?) has just complained:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2018-07-12%2007%3A36%3A01
--
Michael

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#2)
Re: pgsql: Allow using the updated tuple while moving it to a different par

On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jul 12, 2018 at 07:33:54AM +0000, Amit Kapila wrote:

Allow using the updated tuple while moving it to a different partition.

An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain. Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.

In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.

Looks unrelated, but flaviventris (a snake?) has just complained:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2018-07-12%2007%3A36%3A01

Yeah, I am confused as to how this commit can lead to that failure.
Anyway, just to clarify my understanding, it seems that master node
has executed an Insert statement (INSERT INTO replayed(val) SELECT
coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val) and
waiting for standby to sync up. In the meantime, some background
process (probably walsender ?) crashed. Are you seeing anything
different? Any other clue?

By the way, I could see the same (appears to be same) failure in other
buildfarm (skink, serinus, etc.) logs in some previous runs:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2017-11-30%2008%3A47%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2018-05-18%2020%3A14%3A01

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#3)
Re: pgsql: Allow using the updated tuple while moving it to a different par

Amit Kapila <amit.kapila16@gmail.com> writes:

On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael@paquier.xyz> wrote:

Looks unrelated, but flaviventris (a snake?) has just complained:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2018-07-12%2007%3A36%3A01

Yeah, I am confused as to how this commit can lead to that failure.

It didn't; looks like just a random failure to me. The test is timing
out:

ok 26 - catalog xmin of cascaded slot still null with hs_feedback reset
# re-enabling hot_standby_feedback and disabling while stopped
### Reloading node "standby_2"
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata reload
server signaled
Waiting for replication conn standby_1's replay_lsn to pass '0/30A1720' on master
done
Waiting for replication conn standby_2's replay_lsn to pass '0/30A1720' on standby_1
done
### Stopping node "standby_2" using mode fast
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata -m fast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby_2"
#
Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.

This looks to me like it's probably a low-probability timing problem in
the test script itself. The relevant part of the script is

$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();

$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = off;');
$node_standby_2->stop;

($xmin, $catalog_xmin) =
get_slot_xmins($node_standby_1, $slotname_2, "xmin IS NOT NULL");
isnt($xmin, '', 'xmin of cascaded slot non-null with postgres shut down');

where the failure is occurring because get_slot_xmins() never sees the
expected state. I'm suspicious that this is happening because the script
is making no attempt to ensure the relative timing of events on the three
servers --- notably, issuing a "reload" does nothing to guarantee that
the server has actually seen and responded to the parameter change.

As an extreme case, I can reproduce the failure exactly with this:

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index a0d3e8f..0824184 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -286,7 +286,7 @@ is($catalog_xmin, '',
 note "re-enabling hot_standby_feedback and disabling while stopped";
 $node_standby_2->safe_psql('postgres',
    'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
+# $node_standby_2->reload;

$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();

so I'm thinking that what we're seeing is an effect of the second standby
sometimes being too slow to turn on hot_standby_feedback.

We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)

I wrote:

This looks to me like it's probably a low-probability timing problem in
the test script itself.
...
We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.

On further reflection, I doubt that would help. The issue isn't whether
the standby_2 postmaster has processed the SIGHUP, but whether its
walreceiver process has sent a feedback message. The loop in
WalReceiverMain does things in this order:

1. Check for SIGTERM, exit if seen.

2. Check for SIGHUP, process that (and possibly send a feedback message)
if seen.

3. Read and process streaming data till no more is available.

We've seen just a tiny number of failures at this point in the test
--- scraping the buildfarm database, I find these:

sysname | snapshot | stage | l
--------------+---------------------+---------------+------------------------------------------------------------------------------
longfin | 2017-09-26 14:39:05 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
flaviventris | 2018-04-07 17:02:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
serinus | 2018-05-18 20:14:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155.
idiacanthus | 2018-06-23 12:46:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
francolin | 2018-06-29 13:30:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
flaviventris | 2018-07-12 07:36:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
(6 rows)

So it's clearly a very low-probability race condition. My hypothesis
is that the SIGHUP arrives, but the walreceiver process is already in
the read-and-process-data part of its loop, and the timing is such that
it manages to stay there until we issue the standby2 shutdown command.
When it finally notices that it has no more data available, it iterates
the outermost loop, and then shuts down without ever having sent a
feedback message. The "replay_check()" call in the test script doesn't
ensure any synchronization here, since it only verifies that WAL data
has been applied, not that any feedback has happened.

I can reproduce the failure pretty reliably with a hack like the one
attached, which makes it unlikely that the walreceiver will send a
feedback message instantly when it gets the SIGHUP.

So the issue boils down to this: the test script is, effectively,
assuming that it's guaranteed that the walreceiver will send a feedback
message before it shuts down; but there is no such guarantee. Is this
a bug in the test script, or a bug in the walreceiver logic? I can
see the value of having such a guarantee, but I also think it would be
nigh impossible to make it a bulletproof guarantee. We could perhaps
add "XLogWalRcvSendHSFeedback(true)" somewhere closer to process exit,
but that might add more failure modes than it removes.

Or we could change the test script to wait for feedback before it
issues the shutdown, but then I think the test is a bit pointless.

Comments?

regards, tom lane

Attachments:

delay-hs-feedback.patchtext/x-diff; charset=us-ascii; name=delay-hs-feedback.patchDownload+2-2
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#5)
Re: YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)

On Sat, Jul 14, 2018 at 11:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I can reproduce the failure pretty reliably with a hack like the one
attached, which makes it unlikely that the walreceiver will send a
feedback message instantly when it gets the SIGHUP.

So the issue boils down to this: the test script is, effectively,
assuming that it's guaranteed that the walreceiver will send a feedback
message before it shuts down; but there is no such guarantee. Is this
a bug in the test script, or a bug in the walreceiver logic? I can
see the value of having such a guarantee, but I also think it would be
nigh impossible to make it a bulletproof guarantee. We could perhaps
add "XLogWalRcvSendHSFeedback(true)" somewhere closer to process exit,
but that might add more failure modes than it removes.

Or we could change the test script to wait for feedback before it
issues the shutdown, but then I think the test is a bit pointless.

Currently, neither the code nor our documentation suggests that we can
expect HSFeedback before the shutdown, so it is better to adjust the
test. If the sole purpose of the test is to test feedback after
shutdown, then it is worth considering to remove the test altogether.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#6)
Re: YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, Jul 14, 2018 at 11:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So the issue boils down to this: the test script is, effectively,
assuming that it's guaranteed that the walreceiver will send a feedback
message before it shuts down; but there is no such guarantee. Is this
a bug in the test script, or a bug in the walreceiver logic? I can
see the value of having such a guarantee, but I also think it would be
nigh impossible to make it a bulletproof guarantee. We could perhaps
add "XLogWalRcvSendHSFeedback(true)" somewhere closer to process exit,
but that might add more failure modes than it removes.

Or we could change the test script to wait for feedback before it
issues the shutdown, but then I think the test is a bit pointless.

We've seen two more instances of this failure in the buildfarm since
this exchange (dragonet and flaviventris --- seems like Andres' menagerie
is particularly prone to it).

Currently, neither the code nor our documentation suggests that we can
expect HSFeedback before the shutdown, so it is better to adjust the
test. If the sole purpose of the test is to test feedback after
shutdown, then it is worth considering to remove the test altogether.

Hearing nobody speaking in favor of the other alternatives,
I've removed the test.

regards, tom lane

#8Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#7)
Re: YA race condition in 001_stream_rep.pl (was Re: pgsql: Allow using the updated tuple while moving it to a different par)

On Wed, Jul 18, 2018 at 05:42:40PM -0400, Tom Lane wrote:

Hearing nobody speaking in favor of the other alternatives,
I've removed the test.

I had this problem running in the background for a couple of days, but I
could not come up with a solution cleaner than just removing the test.
Thanks for doing so.
--
Michael