postgres_fdw test timeouts
I noticed that the postgres_fdw test periodically times out on Windows:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-10%2003%3A12%3A58
https://cirrus-ci.com/task/5504294095421440
https://cirrus-ci.com/task/4814111003901952
https://cirrus-ci.com/task/5343037535027200
https://cirrus-ci.com/task/5893655966253056
https://cirrus-ci.com/task/4568159320014848
https://cirrus-ci.com/task/5238067850641408
(and many more from cfbot)
From a quick sampling, the relevant test logs end with either
ERROR: server "unknownserver" does not exist
STATEMENT: SELECT postgres_fdw_disconnect('unknownserver');
or
ERROR: relation "public.non_existent_table" does not exist
CONTEXT: remote SQL command: SELECT a, b, c FROM public.non_existent_table
STATEMENT: SELECT * FROM async_pt;
before the test seems to hang.
AFAICT the failures began around September 10th, which leads me to wonder
if this is related to commit 04a09ee. That is little more than a wild
guess, though. I haven't been able to deduce much else from the logs I can
find, and I didn't find any previous reports about this in the archives
after lots of searching, so I thought I'd at least park these notes here in
case anyone else has ideas.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 1, 2023 at 9:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
AFAICT the failures began around September 10th, which leads me to wonder
if this is related to commit 04a09ee. That is little more than a wild
guess, though. I haven't been able to deduce much else from the logs I can
find, and I didn't find any previous reports about this in the archives
after lots of searching, so I thought I'd at least park these notes here in
case anyone else has ideas.
Thanks for finding this correlation. Yeah, poking around in the cfbot
history database I see about 1 failure like that per day since that
date, and there doesn't seem to be anything else as obviously likely
to be related to wakeups and timeouts. I don't understand what's
wrong with the logic, and I think it would take someone willing to
debug it locally to figure that out. Unless someone has an idea, I'm
leaning towards reverting that commit and leaving the relatively minor
problem that it was intended to fix as a TODO.
Hello Thomas,
03.12.2023 02:48, Thomas Munro wrote:
Thanks for finding this correlation. Yeah, poking around in the cfbot
history database I see about 1 failure like that per day since that
date, and there doesn't seem to be anything else as obviously likely
to be related to wakeups and timeouts. I don't understand what's
wrong with the logic, and I think it would take someone willing to
debug it locally to figure that out. Unless someone has an idea, I'm
leaning towards reverting that commit and leaving the relatively minor
problem that it was intended to fix as a TODO
I've managed to reproduce the failure locally when running postgres_fdw_x/
regress in parallel (--num-processes 10). It reproduced for me on
on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
passed).
I'm going to investigate this case within days. Maybe we could find a
better fix for the issue.
Best regards,
Alexander
On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
I've managed to reproduce the failure locally when running postgres_fdw_x/
regress in parallel (--num-processes 10). It reproduced for me on
on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
passed).I'm going to investigate this case within days. Maybe we could find a
better fix for the issue.
Thanks. One thing I can recommend to anyone trying to understand the
change is that you view it with:
git show --ignore-all-space 04a09ee
... because it changed a lot of indentation when wrapping a bunch of
stuff in a new for loop.
Hello Thomas,
03.12.2023 09:00, Thomas Munro wrote:
On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
I've managed to reproduce the failure locally when running postgres_fdw_x/
regress in parallel (--num-processes 10). It reproduced for me on
on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
passed).I'm going to investigate this case within days. Maybe we could find a
better fix for the issue.Thanks. One thing I can recommend to anyone trying to understand the
change is that you view it with:
I think, I've found out what's going on here.
The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
ExecAppendAsyncEventWait().
For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
where ExecAsyncNotify() is called for the first AsyncRequest, but the
second one also processed inside, through a recursive call to
ExecAppendAsyncEventWait():
-> ExecAsyncNotify -> produce_tuple_asynchronously
-> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
-> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
-> ExecAppendAsyncEventWait.
Here we get into the first loop and call ExecAsyncConfigureWait() for the
second AsyncRequest (because we haven't reset it's callback_pending yet),
and it leads to creating another WaitEvent for the second socket inside
postgresForeignAsyncConfigureWait():
AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...
This WaitEvent seemingly misses an event that we should get for that socket.
It's not that important to get noccured > 1 in
ExecAppendAsyncEventWait() to see the failure, it's enough to call
WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket
(I tried to exit from the WaitEventSetWaitBlock's new loop prematurely,
without touching occurred_events, returned_events on a second iteration of
the loop).
So it looks like we have the same issue with multiple event handles
associated with a single socket here.
And v2-0003-Redesign-Windows-socket-event-management.patch from [1]/messages/by-id/CA+hUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ@mail.gmail.com
"surprisingly" helps in this case as well (I could not see a failure for
100 iterations of 10 tests in parallel).
[1]: /messages/by-id/CA+hUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ@mail.gmail.com
Best regards,
Alexander
On Thu, Dec 7, 2023 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
I think, I've found out what's going on here.
The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
ExecAppendAsyncEventWait().
For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
where ExecAsyncNotify() is called for the first AsyncRequest, but the
second one also processed inside, through a recursive call to
ExecAppendAsyncEventWait():
-> ExecAsyncNotify -> produce_tuple_asynchronously
-> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
-> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
-> ExecAppendAsyncEventWait.
Here we get into the first loop and call ExecAsyncConfigureWait() for the
second AsyncRequest (because we haven't reset it's callback_pending yet),
and it leads to creating another WaitEvent for the second socket inside
postgresForeignAsyncConfigureWait():
AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...
Oh, wow. Nice detective work! Thank you for figuring that out.
So it looks like we have the same issue with multiple event handles
associated with a single socket here.
And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
"surprisingly" helps in this case as well (I could not see a failure for
100 iterations of 10 tests in parallel).
Yeah, this makes perfect sense.
So, commit 04a09ee is not guilty. But as the saying goes, "no good
deed goes unpunished", and work on our Windows port seems to be
especially prone to backfiring when kludges combine...
Now we have the question of whether to go forwards (commit the "socket
table" thing), or backwards (revert 04a09ee for now to clear the CI
failures). I don't love the hidden complexity of the socket table and
am not in a hurry to commit it, but I don't currently see another
way... on the other hand we have other CI flapping due to that problem
too so reverting 04a09ee would be sweeping problems under the carpet.
I still need to process your feedback/discoveries on that other thread
and it may take a few weeks for me to get to it.
On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote:
Oh, wow. Nice detective work! Thank you for figuring that out.
+1
Now we have the question of whether to go forwards (commit the "socket
table" thing), or backwards (revert 04a09ee for now to clear the CI
failures). I don't love the hidden complexity of the socket table and
am not in a hurry to commit it, but I don't currently see another
way... on the other hand we have other CI flapping due to that problem
too so reverting 04a09ee would be sweeping problems under the carpet.
I still need to process your feedback/discoveries on that other thread
and it may take a few weeks for me to get to it.
I don't think we need to revert 04a09ee provided the issue is unrelated and
a fix is in development.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
08.12.2023 02:02, Nathan Bossart wrote:
On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote:
Now we have the question of whether to go forwards (commit the "socket
table" thing), or backwards (revert 04a09ee for now to clear the CI
failures). I don't love the hidden complexity of the socket table and
am not in a hurry to commit it, but I don't currently see another
way... on the other hand we have other CI flapping due to that problem
too so reverting 04a09ee would be sweeping problems under the carpet.
I still need to process your feedback/discoveries on that other thread
and it may take a few weeks for me to get to it.I don't think we need to revert 04a09ee provided the issue is unrelated and
a fix is in development.
I've reviewed the links posted upthread and analyzed statistics of such
failures:
yes, it happens rather frequently in Cirrus CI, but there might be dozens
of successful runs, for example:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F45%2F3686
has 1 postgres_fdw failure on Windows per 32 runs.
And there is only one such failure for 90 days in the buildfarm.
(Perhaps the probability of the failure depend on external factors, such as
concurrent activity.)
So I would not say that it's a dominant failure for now, and given that
04a09ee lives in master only, maybe we can save two commits (Revert +
Revert of revert) while moving to a more persistent solution.
Best regards,
Alexander
On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:
So I would not say that it's a dominant failure for now, and given that
04a09ee lives in master only, maybe we can save two commits (Revert +
Revert of revert) while moving to a more persistent solution.
I just checked in on this one to see whether we needed to create an "open
item" for v17. While I'm not seeing the failures anymore, the patch that
Alexander claimed should fix it [0]/messages/by-id/CA+hUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ@mail.gmail.com doesn't appear to have been committed,
either. Perhaps this was fixed by something else...
[0]: /messages/by-id/CA+hUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hello Nathan,
08.03.2024 01:00, Nathan Bossart wrote:
On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:
So I would not say that it's a dominant failure for now, and given that
04a09ee lives in master only, maybe we can save two commits (Revert +
Revert of revert) while moving to a more persistent solution.I just checked in on this one to see whether we needed to create an "open
item" for v17. While I'm not seeing the failures anymore, the patch that
Alexander claimed should fix it [0] doesn't appear to have been committed,
either. Perhaps this was fixed by something else...[0] /messages/by-id/CA+hUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ@mail.gmail.com
I have re-run the tests and found out that the issue was fixed by
d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))",
formerly contained in pgfdw_get_result() as follows:
/* Data available in socket? */
if (wc & WL_SOCKET_READABLE)
{
if (!PQconsumeInput(conn))
pgfdw_report_error(ERROR, NULL, conn, false, query);
}
->
/* Consume whatever data is available from the socket */
if (PQconsumeInput(conn) == 0)
{
/* trouble; expect PQgetResult() to return NULL */
break;
}
That is, the unconditional "if PQconsumeInput() ..." eliminates the test
timeout.
Best regards,
Alexander
On Sat, Mar 09, 2024 at 10:00:00AM +0300, Alexander Lakhin wrote:
I have re-run the tests and found out that the issue was fixed by
d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))",
formerly contained in pgfdw_get_result() as follows:
�� ���� ��� ��� /* Data available in socket? */
�� ���� ��� ��� if (wc & WL_SOCKET_READABLE)
�� ���� ��� ��� {
�� ���� ��� ��� ��� if (!PQconsumeInput(conn))
�� ���� ��� ��� ��� ��� pgfdw_report_error(ERROR, NULL, conn, false, query);
�� ���� ��� ��� }
->
�� ���� /* Consume whatever data is available from the socket */
�� ���� if (PQconsumeInput(conn) == 0)
�� ���� {
�� ���� ��� /* trouble; expect PQgetResult() to return NULL */
�� ���� ��� break;
�� ���� }That is, the unconditional "if PQconsumeInput() ..." eliminates the test
timeout.
Thanks for confirming! I'm assuming this just masks the underlying
issue...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com