Timing-sensitive case in src/test/recovery TAP tests
I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop. I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms. That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.
This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine. I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.
Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough. I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?
regards, tom lane
Attachments:
add-delay-in-time-sensitive-test.patchtext/x-diff; charset=us-ascii; name=add-delay-in-time-sensitive-test.patchDownload+2-0
On 26 June 2017 at 05:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop. I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms. That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine. I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough. I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?
This should do the trick:
$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.
Could it be possible to remove as well the second sleep(2) call in
this test please? This could be replaced by a similar poll using the
query Craig has just given as this needs to wait until the xmon is
cleared. Other tests expect a value so this poll cannot happen in
get_slot_xmins.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.
Could it be possible to remove as well the second sleep(2) call in
this test please?
Yes, I'd like to see those fixed sleeps go away too. Want to work
on a concrete patch?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26 June 2017 at 10:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.Could it be possible to remove as well the second sleep(2) call in
this test please?Yes, I'd like to see those fixed sleeps go away too. Want to work
on a concrete patch?
Attached.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
fix-timing-in-tap-001-stream-rep.plapplication/x-perl; name=fix-timing-in-tap-001-stream-rep.plDownload+4-2
On Mon, Jun 26, 2017 at 11:44 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 26 June 2017 at 10:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.Could it be possible to remove as well the second sleep(2) call in
this test please?Yes, I'd like to see those fixed sleeps go away too. Want to work
on a concrete patch?Attached.
Thanks for the patch.
As long as we are on it, there is this code block in the test:
my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
This should be more verbose as the 2nd and 4th test should say
"catalog xmin" instead of xmin.
Also, wouldn't it be better to poll as well node_standby_1's
pg_replication_slot on slotname_2? It would really seem better to make
the nullness check conditional in get_slot_xmins instead. Sorry for
changing opinion here.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26 June 2017 at 11:06, Michael Paquier <michael.paquier@gmail.com> wrote:
As long as we are on it, there is this code block in the test:
my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');This should be more verbose as the 2nd and 4th test should say
"catalog xmin" instead of xmin.
Agree, should. Mind posting a revision?
Also, wouldn't it be better to poll as well node_standby_1's
pg_replication_slot on slotname_2? It would really seem better to make
the nullness check conditional in get_slot_xmins instead. Sorry for
changing opinion here.
I'm not sure I understand this.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 26 June 2017 at 11:06, Michael Paquier <michael.paquier@gmail.com> wrote:
As long as we are on it, there is this code block in the test:
my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');This should be more verbose as the 2nd and 4th test should say
"catalog xmin" instead of xmin.Agree, should. Mind posting a revision?
Sure.
Also, wouldn't it be better to poll as well node_standby_1's
pg_replication_slot on slotname_2? It would really seem better to make
the nullness check conditional in get_slot_xmins instead. Sorry for
changing opinion here.I'm not sure I understand this.
The patch attached may explain that better. Your patch added 3 poll
phases. I think that a 4th is needed. At the same time I have found
cleaner to put the poll calls into a small wrapper.
--
Michael
Attachments:
fix-timing-in-tap-001-stream-rep-v2.pltext/x-perl-script; charset=US-ASCII; name=fix-timing-in-tap-001-stream-rep-v2.plDownload+39-20
Michael Paquier <michael.paquier@gmail.com> writes:
On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
I'm not sure I understand this.
The patch attached may explain that better. Your patch added 3 poll
phases. I think that a 4th is needed. At the same time I have found
cleaner to put the poll calls into a small wrapper.
Looks good as far as it goes, but I wonder whether any of the other
get_slot_xmins calls need polling too. Don't feel a need to add such
calls until someone exhibits a failure there, but I won't be very
surprised if someone does.
Anyway, pushed this patch with minor editing. Thanks!
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Looks good as far as it goes, but I wonder whether any of the other
get_slot_xmins calls need polling too. Don't feel a need to add such
calls until someone exhibits a failure there, but I won't be very
surprised if someone does.
I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.
Anyway, pushed this patch with minor editing. Thanks!
Thanks.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Looks good as far as it goes, but I wonder whether any of the other
get_slot_xmins calls need polling too. Don't feel a need to add such
calls until someone exhibits a failure there, but I won't be very
surprised if someone does.
And behold, we have here
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-08-08%2020%3A54%3A09
# Failed test 'xmin of cascaded slot non-null with hs feedback'
# at t/001_stream_rep.pl line 224.
# got: ''
# expected: anything else
That's one of only four calls lacking a preceding wait_slot_xmins call.
The ones at lines 173 and 177 seem safe because nothing has happened yet.
I think the one at line 300 should be safe because the standby_2 server is
shut down at that point, but is there any way that the standby_1's view
hasn't updated by the time that happens?
I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.
Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...
Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.
Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...
Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?
Sure, have at it.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?Sure, have at it.
And here you go.
--
Michael
Attachments:
tap-simplify-repslot.patchapplication/octet-stream; name=tap-simplify-repslot.patchDownload+27-40
Michael Paquier <michael.paquier@gmail.com> writes:
On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?
Sure, have at it.
And here you go.
Pushed with a bit of work on the comments.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Aug 13, 2017 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?Sure, have at it.
And here you go.
Pushed with a bit of work on the comments.
Thanks.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers