Add isolation test template in injection_points for wait/wakeup/detach
Hi all,
This was on my TODO bucket for some time. The isolation test
"inplace" in the isolation test suite for injection_points relies on a
couple of behaviors implemented in the backend. One of them is that a
point detach should not affect a wait. Rather than having to guess
this stuff from the sole isolation test we have in this module, I
would like to add a simple test to document all these assumptions.
This also works as a simple template for a newcomer willing to
implement an isolation test with injection points. That's less
intimidating than the inplace test with its VACUUM and GRANT
interactions, based on an injection point run when updating the stats
of a relation in VACUUM.
Thoughts?
--
Michael
Attachments:
0001-injection_points-Add-basic-isolation-test.patchtext/x-diff; charset=us-asciiDownload+111-2
Hi,
On Fri, Oct 18, 2024 at 07:44:08AM +0900, Michael Paquier wrote:
Hi all,
This was on my TODO bucket for some time. The isolation test
"inplace" in the isolation test suite for injection_points relies on a
couple of behaviors implemented in the backend. One of them is that a
point detach should not affect a wait. Rather than having to guess
this stuff from the sole isolation test we have in this module, I
would like to add a simple test to document all these assumptions.
This also works as a simple template for a newcomer willing to
implement an isolation test with injection points. That's less
intimidating than the inplace test with its VACUUM and GRANT
interactions, based on an injection point run when updating the stats
of a relation in VACUUM.
Thoughts?
I think that it makes sense to "expose" the basics in a dedicated test.
I thought that template.spec could have been another "naming" option but
basic.spec is better I think.
The test is pretty straightforward so I don't have that much to say, just one Nit:
1 ===
+# Detach before wait does not cause a wait.
What about?
Detach before wait does not cause a wait (and so wakeup produces an error)
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Oct 18, 2024 at 07:58:26AM +0000, Bertrand Drouvot wrote:
I think that it makes sense to "expose" the basics in a dedicated test.
I thought that template.spec could have been another "naming" option but
basic.spec is better I think.
I am not wedded to the name I have proposed. No comment on my naming
sense..
+# Detach before wait does not cause a wait.
What about?
Detach before wait does not cause a wait (and so wakeup produces an error)
Sure, why not.
--
Michael
On Fri, Oct 18, 2024 at 06:16:50PM +0900, Michael Paquier wrote:
Sure, why not.
Used "basic" as test name at the end, tweaked the comment close to
what you have suggested, and applied the result.
Thanks for the review.
--
Michael
On Mon, Oct 21, 2024 at 11:13:59AM +0900, Michael Paquier wrote:
Used "basic" as test name at the end, tweaked the comment close to
what you have suggested, and applied the result.
I have spotted a race condition with this test in some CF bot runs.
Here are some of them:
https://github.com/michaelpq/postgres/runs/31984161747
https://cirrus-ci.com/task/6741385749987328
https://cirrus-ci.com/task/6056918689513472
https://cirrus-ci.com/task/5863397261049856
https://cirrus-ci.com/task/5565235765968896
https://cirrus-ci.com/task/6238594598174720
All of them show up as follows in regression.diffs:
diff -U3
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
---
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
2024-10-24 02:36:24.068590000 +0000
+++
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
2024-10-24 02:40:09.179695000 +0000
@@ -13,16 +13,16 @@
(1 row)
-step wait1: <... completed>
-injection_points_run
---------------------
-
-(1 row)
-
step detach2: SELECT injection_points_detach('injection-points-wait');
injection_points_detach
-----------------------
+(1 row)
+
+step wait1: <... completed>
+injection_points_run
+--------------------
+
(1 row)
This is in the first permutation of the test done with "wait1 wakeup2
detach2", and the diff means that the backend running the "wait"
callback is reported as finished after the detach is done,
injection_points_run being only used for the waits. Hence the wait is
so slow to finish that the detach has time to complete and finish,
breaking the output.
And here comes the puzzling part: all of failures involve FreeBSD 13
in the CI. Reproducing this failure would not be difficult, I thought
first; we can add a hardcoded pg_usleep() to delay the end of
injection_wait() so as we make sure that the backend doing the wait
reports later than the detach. Or just do the same at the end of
injection_points_run() once the callback exits. I've sure done that,
placing some strategic pg_usleep() calls on Linux to make the paths
that matter in the wait slower, but the test remains stable. The CI
on Linux is stable as well: 3rd and 4th columns of the cfbot are
green, I did not spot any failures related to this isolation test in
injection_points. Only the second column about FreeBSD is going rogue
on a periodic basis.
One fix would is to remove this first permutation test, still that's
hiding a problem rather than solving it, and something looks wrong
with conditional variables, specific to FreeBSD? Hence, I'm puzzled
by that.
I am going to remove the first permutation anyway to stabilize the CI
reports. But it feels like we have a different problem here, and I am
not sure what.
Any thoughts or comments?
--
Michael
On Fri, Oct 25, 2024 at 09:33:12AM +0900, Michael Paquier wrote:
I am going to remove the first permutation anyway to stabilize the CI
reports. But it feels like we have a different problem here, and I am
not sure what.
A few days later, I have looked at the full set of CF bot jobs and
9f00edc22888 has stabilized the CI. We went down from 20-ish failures
in the whole set of patches to 0.
I still don't have a clear idea about why this is limited to FreeBSD
13 and only for this permutation. The buildfarm did not report
anything, except if I've missed something, of course.
--
Michael
Hi,
On Fri, Oct 25, 2024 at 09:33:12AM +0900, Michael Paquier wrote:
This is in the first permutation of the test done with "wait1 wakeup2
detach2", and the diff means that the backend running the "wait"
callback is reported as finished after the detach is done,
injection_points_run being only used for the waits. Hence the wait is
so slow to finish that the detach has time to complete and finish,
breaking the output.
Yeah, I agree with your analysis.
And here comes the puzzling part: all of failures involve FreeBSD 13
in the CI. Reproducing this failure would not be difficult, I thought
first; we can add a hardcoded pg_usleep() to delay the end of
injection_wait() so as we make sure that the backend doing the wait
reports later than the detach. Or just do the same at the end of
injection_points_run() once the callback exits. I've sure done that,
placing some strategic pg_usleep() calls on Linux to make the paths
that matter in the wait slower, but the test remains stable.
Right, I did the same and observed the exact same behavior.
Still, it's possible to observe the s1 wait finishing after the s2 detach by
running this test manually (means create 2 sessions and run the test commands
manually) and:
1. attach a debuger on the first session (say with a break point in injection_wait()).
or
2. add a hardcoded pg_usleep() in injection_wait()
So I think that the s1 wait finishing after the s2 detach is possible if
the session 1 is "freezed" (gdb case) or slow enough (pg_usleep() case).
The CI
on Linux is stable as well: 3rd and 4th columns of the cfbot are
green, I did not spot any failures related to this isolation test in
injection_points. Only the second column about FreeBSD is going rogue
on a periodic basis.One fix would is to remove this first permutation test, still that's
hiding a problem rather than solving it, and something looks wrong
with conditional variables, specific to FreeBSD?
Hum, we would probably observe other failures in other tests no?
Any thoughts or comments?
I think that we can not be 100% sure that the s1 wait will finish before the
s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
that other OS could also report the test as failing for the same reason.
It's not ideal, but instead of removing this first permutation test what about
adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
new step before the detach2 one?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Oct 28, 2024 at 07:17:28AM +0000, Bertrand Drouvot wrote:
I think that we can not be 100% sure that the s1 wait will finish before the
s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
that other OS could also report the test as failing for the same reason.
Yes, the only safe thing we can do in this test is to let the wakeup2
be last, as we are sure that the isolationtester is going to keep at
s2 to finish the call of the wakeup function before moving on with
checking the end of the wait.
It's not ideal, but instead of removing this first permutation test what about
adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
new step before the detach2 one?
There is no real guarantee of stability. Under a wait of N seconds,
we could still have environments where the wait() could remain stuck
more than N seconds between the moment the condition variable is woken
up and the result of the wait() is reported back to the client. And
hardcoded sleeps make the test slower even on fast machines.
What we have here seems like just contention of Cirrus with the
FreeBSD hosts while there is much more stability with the linux hosts.
--
Michael
Hi,
On Tue, Oct 29, 2024 at 01:12:25PM +0900, Michael Paquier wrote:
On Mon, Oct 28, 2024 at 07:17:28AM +0000, Bertrand Drouvot wrote:
I think that we can not be 100% sure that the s1 wait will finish before the
s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
that other OS could also report the test as failing for the same reason.Yes, the only safe thing we can do in this test is to let the wakeup2
be last, as we are sure that the isolationtester is going to keep at
s2 to finish the call of the wakeup function before moving on with
checking the end of the wait.
Agree.
It's not ideal, but instead of removing this first permutation test what about
adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
new step before the detach2 one?There is no real guarantee of stability.
Right.
Under a wait of N seconds,
we could still have environments where the wait() could remain stuck
more than N seconds between the moment the condition variable is woken
up and the result of the wait() is reported back to the client. And
hardcoded sleeps make the test slower even on fast machines.
Yeah, not ideal so better to keep the test as it is currently (after your
refactoring).
What we have here seems like just contention of Cirrus with the
FreeBSD hosts while there is much more stability with the linux hosts.
That sounds like it, agree.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 29, 2024 at 07:59:31AM +0000, Bertrand Drouvot wrote:
On Tue, Oct 29, 2024 at 01:12:25PM +0900, Michael Paquier wrote:
Under a wait of N seconds,
we could still have environments where the wait() could remain stuck
more than N seconds between the moment the condition variable is woken
up and the result of the wait() is reported back to the client. And
hardcoded sleeps make the test slower even on fast machines.Yeah, not ideal so better to keep the test as it is currently (after your
refactoring).
Backpedalling a bit on that.. Perhaps it would be useful to keep the
permutation as disabled in the test and to update the comment in
basic.spec to tell that this had better be avoided when implementing a
test if the timing of the return value of a SQL wait() cannot be
strictly controlled? And also say that putting the wakeup() should be
last for simple permutations? The point of this spec is also to
document some good and bad practices.
--
Michael
Hi,
On Tue, Oct 29, 2024 at 05:20:42PM +0900, Michael Paquier wrote:
On Tue, Oct 29, 2024 at 07:59:31AM +0000, Bertrand Drouvot wrote:
On Tue, Oct 29, 2024 at 01:12:25PM +0900, Michael Paquier wrote:
Under a wait of N seconds,
we could still have environments where the wait() could remain stuck
more than N seconds between the moment the condition variable is woken
up and the result of the wait() is reported back to the client. And
hardcoded sleeps make the test slower even on fast machines.Yeah, not ideal so better to keep the test as it is currently (after your
refactoring).Backpedalling a bit on that.. Perhaps it would be useful to keep the
permutation as disabled in the test and to update the comment in
basic.spec to tell that this had better be avoided when implementing a
test if the timing of the return value of a SQL wait() cannot be
strictly controlled? And also say that putting the wakeup() should be
last for simple permutations? The point of this spec is also to
document some good and bad practices.
Yeah, agree that it would make sense to document in the test what has been
discovered here.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 29, 2024 at 02:06:11PM +0000, Bertrand Drouvot wrote:
Yeah, agree that it would make sense to document in the test what has been
discovered here.
How about something like the attached?
--
Michael
Attachments:
inj-permutation-comment.patchtext/x-diff; charset=us-asciiDownload+5-1
Hi,
On Wed, Oct 30, 2024 at 12:40:05PM +0900, Michael Paquier wrote:
On Tue, Oct 29, 2024 at 02:06:11PM +0000, Bertrand Drouvot wrote:
Yeah, agree that it would make sense to document in the test what has been
discovered here.How about something like the attached?
Thanks!
A few comments:
1 === Nit
s/Permutations like this one/Permutations like the following commented out one/ ?
2 ===
s/back its result/back its result once woken up/ ?
3 ===
s/faster than the wait/before the SQL function doing the wait returns its result/ ?
With 2 === and 3 === I think it's easier to "link" both sentences that way.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Oct 30, 2024 at 07:26:07AM +0000, Bertrand Drouvot wrote:
1 === Nit
s/Permutations like this one/Permutations like the following commented out one/ ?
2 ===
s/back its result/back its result once woken up/ ?
3 ===
s/faster than the wait/before the SQL function doing the wait returns its result/ ?
With 2 === and 3 === I think it's easier to "link" both sentences that way.
What do you think about the updated version attached?
--
Michael
Attachments:
inj-permutation-comment-v2.patchtext/x-diff; charset=us-asciiDownload+4-1
Hi,
On Wed, Oct 30, 2024 at 06:21:04PM +0900, Michael Paquier wrote:
On Wed, Oct 30, 2024 at 07:26:07AM +0000, Bertrand Drouvot wrote:
1 === Nit
s/Permutations like this one/Permutations like the following commented out one/ ?
2 ===
s/back its result/back its result once woken up/ ?
3 ===
s/faster than the wait/before the SQL function doing the wait returns its result/ ?
With 2 === and 3 === I think it's easier to "link" both sentences that way.
What do you think about the updated version attached?
Thanks! LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Oct 30, 2024 at 09:27:13AM +0000, Bertrand Drouvot wrote:
Thanks! LGTM.
Thanks for the review. Applied that, then.
--
Michael
On Thu, Oct 31, 2024 at 08:47:04AM +0900, Michael Paquier wrote:
Thanks for the review. Applied that, then.
I was looking at src/test/isolation/README, and based on what is
described for parenthesized markers (which I didn't know about so we
learn a new thing every day), it is possible to force a strict
ordering of the test. Quote that matters:
"A marker consisting solely of a step name indicates that this step
may not be reported as completing until that other step has
completed."
In our case, the problem that caused 9f00edc22888 is that the "wait"
step could be so slow that the "detach" step could report before the
"wait" step.
So I'd like to propose that we revisit 9f00edc22888, and re-enable the
permutation with a marker so as the "detach" step waits for the "wait"
step to complete before printing its own report, as per the attached.
That should make the permutation safe to use.
The output of the attached patch is the same as what has been removed
in 9f00edc22888 except that the permutation is changed from that:
permutation wait1 wakeup2 detach2
To that, making sure that detach2 report happens only after wait1
has returned its own result:
permutation wait1 wakeup2 detach2(wait1)
Thoughts?
--
Michael
Attachments:
inj-isolation-test.patchtext/x-diff; charset=us-asciiDownload+31-6
Hi,
On Wed, Feb 05, 2025 at 03:12:53PM +0900, Michael Paquier wrote:
On Thu, Oct 31, 2024 at 08:47:04AM +0900, Michael Paquier wrote:
Thanks for the review. Applied that, then.
I was looking at src/test/isolation/README, and based on what is
described for parenthesized markers (which I didn't know about so we
learn a new thing every day), it is possible to force a strict
ordering of the test. Quote that matters:
"A marker consisting solely of a step name indicates that this step
may not be reported as completing until that other step has
completed."
Oh, I did not know about it too (thanks for sharing).
In our case, the problem that caused 9f00edc22888 is that the "wait"
step could be so slow that the "detach" step could report before the
"wait" step.
Yup.
So I'd like to propose that we revisit 9f00edc22888, and re-enable the
permutation with a marker so as the "detach" step waits for the "wait"
step to complete before printing its own report, as per the attached.
That should make the permutation safe to use.The output of the attached patch is the same as what has been removed
in 9f00edc22888 except that the permutation is changed from that:
permutation wait1 wakeup2 detach2
To that, making sure that detach2 report happens only after wait1
has returned its own result:
permutation wait1 wakeup2 detach2(wait1)Thoughts?
I think that makes sense and the patch LGTM.
A few tests are already using this technique (including injection_points in
inplace.spec).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Feb 05, 2025 at 11:03:25AM +0000, Bertrand Drouvot wrote:
A few tests are already using this technique (including injection_points in
inplace.spec).
Right. I've just re-enabled the permutation. Now let's keep a track
of the activity in the CI and the buildfarm.
--
Michael
On Wed, Feb 05, 2025 at 11:03:25AM +0000, Bertrand Drouvot wrote:
I think that makes sense and the patch LGTM.
A few tests are already using this technique (including injection_points in
inplace.spec).
Well, that's not the end of the story, I have brushed the CF bot for
some activity and noticed this one:
https://cirrus-ci.com/task/6565511640645632
That's kind of rather hard to reach, it seems, but I also got it in
one of my own runs:
--- /tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out 2025-02-06 23:53:40.838077000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out 2025-02-06 23:56:21.848507000 +0000
@@ -13,13 +13,14 @@
(1 row)
+step detach2: SELECT injection_points_detach('injection-points-wait'); <waiting ...>
step wait1: <... completed>
injection_points_run
--------------------
(1 row)
-step detach2: SELECT injection_points_detach('injection-points-wait');
+step detach2: <... completed>
injection_points_detach
-----------------------
This is telling us that the detach step could be seen as waiting by
the isolation tester before the wait phase reports for completion. I
didn't think it would be possible to get that, but well, we do.
A marker like detach2(wait1) is not enough to cover that, as this
ensures the order of the step completion output. Using detach2(*)
which would cause the detach2 step to show as <waiting> immediately is
not good either, as the wait could always complete between the
detach's <waiting> and <completed>.
There is an stronger trick mentioned at the end of the README that
should be able to solve this new problem as well as the original one:
an empty step between the wait and the detach. If we do that, the
detach will never be launched until the wait has fully completed,
bringing a stronger ordering of the events: we should never see the
detach as waiting like in this new problem, now would we see the first
problem where the wait would report its result after the detach.
I have done a total of 10 runs in the CI with the attached, without
getting a failure. HEAD was failing a bit more easily than that, with
at least one failure every 5 runs in my branches. Will go adjust that
in a bit as per the attached.
--
Michael