Always show correct error message for statement timeouts, fixes random buildfarm failures

Started by Jelte Fennema-Nio11 months ago6 messages

me@jeltef.nl

11 months ago

1 attachment(s)

(was part of "Add non-blocking version of PQcancel" thread[1]/messages/by-id/AM5PR83MB0178D3B31CA1B6EC4A8ECC42F7529@AM5PR83MB0178.EURPRD83.prod.outlook.com)

The commitfest app notified me that this patch needed a rebase. While
doing that I'm also moving this patch out of the original thread it
was part of. It fixes a bug that causes this buildfarm failure[2]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason. In
rare cases the current code would report "canceling statement due to
user request", even though it was actually canceled due to a statement
timeout, so it should have this error message: "canceling statement
due to statement timeout". A way to reproduce this problem locally was
provided by Alexander[3]/messages/by-id/4db099c8-4a52-3cc4-e970-14539a319466@gmail.com.

I took it out of the original thread because that has become
impossible to follow (at least for me, so I expect also for others)
due to discussing 3 different random failures involving cancel
requests. And this specific one doesn't even have anything to do with
the original non-blocking PQCancel work.

I have inlined the parts of emails that are relevant to this patch
below. That way others don't have to dissect that thread. (for context
this patch is often, but not always, called patch 0002 in the original
thread).

My initial email containing the patch[4]/messages/by-id/CAGECzQSBp=sCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q@mail.gmail.com:

On Mon, 24 Jun 2024 at 00:59, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion@gmail.com> wrote:
I also came across another failure of the test:
@@ -2774,7 +2774,7 @@
BEGIN;
SET LOCAL statement_timeout = '10ms';
select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request
COMMIT;
which is reproduced with a sleep added here:
@@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
*/
parsetree_list = pg_parse_query(query_string);
+pg_usleep(11000);
After investigating, I realized this actually exposes a bug in our
statement timeout logic. It has nothing to do with posgres_fdw and
reproduces with any regular postgres query too. Attached is a patch
that fixes this issue. This one should probably be backported.

Tom's review[5]/messages/by-id/584811.1725048717@sss.pgh.pa.us:

On Fri, 30 Aug 2024 at 22:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think 0002 is probably outright wrong, or at least the change to
disable_statement_timeout is. Once we get to that, we don't want
to throw a timeout error any more, even if an interrupt was received
just before it.

My initial response to that review[6]/messages/by-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA+LRE7WqJtpPSrJKKUXtH7w@mail.gmail.com:

On Fri, 30 Aug 2024 at 23:24, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

The disable_statement_timeout change was not the part of that patch
that was necessary for stable output, only the change in the first
branch of enable_statement_timeout was necessary. The reason being
that enable_statement_timeout is called multiple times for a query,
because start_xact_command is called multiple times in
exec_simple_query. The change to disable_statement_timeout just seemed
like the logical extension of that change, especially since there was
basically a verbatim copy of disable_statement_timeout in the second
branch of enable_statement_timeout.

To make sure I understand your suggestion correctly: Are you saying
you would want to completely remove the outstanding interrupt if it
was caused by de statement_timout when disable_statement_timeout is
called? Because I agree that would probably make sense, but that
sounds like a more impactful change. But the current behaviour seems
strictly worse than the behaviour proposed in the patch to me, because
currently the backend would still be interrupted, but the error would
indicate a reason for the interrupt that is simply incorrect i.e. it
will say it was cancelled due to a user request, which never happened.

My second response to that review[7]/messages/by-id/CAGECzQQM1N0H3e9RbSqi4=mmFgfLaFNr6R7gNpFoePb9exRryg@mail.gmail.com

On Sat, 31 Aug 2024 at 09:08, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

I also took a quick look at the code again, and completely removing
the outstanding interrupt seems hard to do. Because there's no way to
know if there were multiple causes for the interupt, i.e. someone
could have pressed ctrl+c as well and we wouldn't want to undo that.

So I think the solution in 0002, while debatable if strictly correct,
is the only fix that we can easily do. Also I personally believe the
behaviour resulting from 0002 is totally correct: The new behaviour
would be that if a timeout occurred, right before it was disabled or
reset, but the interrupt was not processed yet, then we process that
timeout as normal. That seems totally reasonable behaviour to me from
the perspective of an end user: You get a timeout error when the
timeout occurred before the timeout was disabled/reset.

[1]: /messages/by-id/AM5PR83MB0178D3B31CA1B6EC4A8ECC42F7529@AM5PR83MB0178.EURPRD83.prod.outlook.com
[2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason
[3]: /messages/by-id/4db099c8-4a52-3cc4-e970-14539a319466@gmail.com
[4]: /messages/by-id/CAGECzQSBp=sCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q@mail.gmail.com
[5]: /messages/by-id/584811.1725048717@sss.pgh.pa.us
[6]: /messages/by-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA+LRE7WqJtpPSrJKKUXtH7w@mail.gmail.com
[7]: /messages/by-id/CAGECzQQM1N0H3e9RbSqi4=mmFgfLaFNr6R7gNpFoePb9exRryg@mail.gmail.com

Attachments:

v5-0001-Do-not-reset-statement_timeout-indicator-outside-.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Do-not-reset-statement_timeout-indicator-outside-.patchDownload

From 79637b130fec7db4a88588abb6e435cbe81d34cf Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 24 Jun 2024 00:29:39 +0200
Subject: [PATCH v5] Do not reset statement_timeout indicator outside of
 ProcessInterupts

The only way that ProcessInterupts can know why QueryCancelPending is
set is by looking at the indicator bits of the various timeouts. We were
resetting the one for STATEMENT_TIMEOUT in various places, thus possibly
causing ProcessInterupts to fail with the wrong error message.
---
 src/backend/tcop/postgres.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 1149d89d7a1..69021912be6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -5134,22 +5134,33 @@ enable_statement_timeout(void)
 	if (StatementTimeout > 0
 		&& (StatementTimeout < TransactionTimeout || TransactionTimeout == 0))
 	{
-		if (!get_timeout_active(STATEMENT_TIMEOUT))
+		/*
+		 * We check both if it's active or if it's already triggered. If it's
+		 * already triggered we don't want to restart it because that clears
+		 * the indicator flag, which in turn would cause the wrong error
+		 * message to be used by ProcessInterrupts() on the next
+		 * CHECK_FOR_INTERRUPTS() call. Restarting the timer in that case
+		 * would be pointless anyway, because the statement timeout error is
+		 * going to trigger on the next CHECK_FOR_INTERRUPTS() call.
+		 */
+		if (!get_timeout_active(STATEMENT_TIMEOUT)
+			&& !get_timeout_indicator(STATEMENT_TIMEOUT, false))
 			enable_timeout_after(STATEMENT_TIMEOUT, StatementTimeout);
 	}
 	else
 	{
-		if (get_timeout_active(STATEMENT_TIMEOUT))
-			disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_statement_timeout();
 	}
 }
 
 /*
- * Disable statement timeout, if active.
+ * Disable statement timeout, if active. We preserve the indicator flag
+ * though, otherwise we'd lose the knowledge in ProcessInterupts that the
+ * SIGINT came from a statement timeout.
  */
 static void
 disable_statement_timeout(void)
 {
 	if (get_timeout_active(STATEMENT_TIMEOUT))
-		disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_timeout(STATEMENT_TIMEOUT, true);
 }

base-commit: 217919dd0954f54402e8d0a38cd203a740754077
-- 
2.43.0

Jelte Fennema-Nio

me@jeltef.nl

about 1 month ago

In reply to: Jelte Fennema-Nio (#1)

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me@jeltef.nl> wrote:

So I think the solution in 0002, while debatable if strictly correct,
is the only fix that we can easily do. Also I personally believe the
behaviour resulting from 0002 is totally correct: The new behaviour
would be that if a timeout occurred, right before it was disabled or
reset, but the interrupt was not processed yet, then we process that
timeout as normal. That seems totally reasonable behaviour to me from
the perspective of an end user: You get a timeout error when the
timeout occurred before the timeout was disabled/reset.

I still think there's no reason for this not to be committed, as it
fixes both a bug and a flaky test.

Show quoted text

On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me@jeltef.nl> wrote:

(was part of "Add non-blocking version of PQcancel" thread[1])

The commitfest app notified me that this patch needed a rebase. While
doing that I'm also moving this patch out of the original thread it
was part of. It fixes a bug that causes this buildfarm failure[2]. In
rare cases the current code would report "canceling statement due to
user request", even though it was actually canceled due to a statement
timeout, so it should have this error message: "canceling statement
due to statement timeout". A way to reproduce this problem locally was
provided by Alexander[3].

I took it out of the original thread because that has become
impossible to follow (at least for me, so I expect also for others)
due to discussing 3 different random failures involving cancel
requests. And this specific one doesn't even have anything to do with
the original non-blocking PQCancel work.

I have inlined the parts of emails that are relevant to this patch
below. That way others don't have to dissect that thread. (for context
this patch is often, but not always, called patch 0002 in the original
thread).

My initial email containing the patch[4]:

On Mon, 24 Jun 2024 at 00:59, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion@gmail.com> wrote:
I also came across another failure of the test:
@@ -2774,7 +2774,7 @@
BEGIN;
SET LOCAL statement_timeout = '10ms';
select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request
COMMIT;
which is reproduced with a sleep added here:
@@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
*/
parsetree_list = pg_parse_query(query_string);
+pg_usleep(11000);
After investigating, I realized this actually exposes a bug in our
statement timeout logic. It has nothing to do with posgres_fdw and
reproduces with any regular postgres query too. Attached is a patch
that fixes this issue. This one should probably be backported.
Tom's review[5]:

On Fri, 30 Aug 2024 at 22:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think 0002 is probably outright wrong, or at least the change to
disable_statement_timeout is. Once we get to that, we don't want
to throw a timeout error any more, even if an interrupt was received
just before it.

My initial response to that review[6]:

On Fri, 30 Aug 2024 at 23:24, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

The disable_statement_timeout change was not the part of that patch
that was necessary for stable output, only the change in the first
branch of enable_statement_timeout was necessary. The reason being
that enable_statement_timeout is called multiple times for a query,
because start_xact_command is called multiple times in
exec_simple_query. The change to disable_statement_timeout just seemed
like the logical extension of that change, especially since there was
basically a verbatim copy of disable_statement_timeout in the second
branch of enable_statement_timeout.

To make sure I understand your suggestion correctly: Are you saying
you would want to completely remove the outstanding interrupt if it
was caused by de statement_timout when disable_statement_timeout is
called? Because I agree that would probably make sense, but that
sounds like a more impactful change. But the current behaviour seems
strictly worse than the behaviour proposed in the patch to me, because
currently the backend would still be interrupted, but the error would
indicate a reason for the interrupt that is simply incorrect i.e. it
will say it was cancelled due to a user request, which never happened.

My second response to that review[7]

On Sat, 31 Aug 2024 at 09:08, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

I also took a quick look at the code again, and completely removing
the outstanding interrupt seems hard to do. Because there's no way to
know if there were multiple causes for the interupt, i.e. someone
could have pressed ctrl+c as well and we wouldn't want to undo that.

So I think the solution in 0002, while debatable if strictly correct,
is the only fix that we can easily do. Also I personally believe the
behaviour resulting from 0002 is totally correct: The new behaviour
would be that if a timeout occurred, right before it was disabled or
reset, but the interrupt was not processed yet, then we process that
timeout as normal. That seems totally reasonable behaviour to me from
the perspective of an end user: You get a timeout error when the
timeout occurred before the timeout was disabled/reset.

[1]: /messages/by-id/AM5PR83MB0178D3B31CA1B6EC4A8ECC42F7529@AM5PR83MB0178.EURPRD83.prod.outlook.com
[2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason
[3]: /messages/by-id/4db099c8-4a52-3cc4-e970-14539a319466@gmail.com
[4]: /messages/by-id/CAGECzQSBp=sCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q@mail.gmail.com
[5]: /messages/by-id/584811.1725048717@sss.pgh.pa.us
[6]: /messages/by-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA+LRE7WqJtpPSrJKKUXtH7w@mail.gmail.com
[7]: /messages/by-id/CAGECzQQM1N0H3e9RbSqi4=mmFgfLaFNr6R7gNpFoePb9exRryg@mail.gmail.com

Akshat Jaimini

destrex271@gmail.com

about 1 month ago

In reply to: Jelte Fennema-Nio (#2)

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hi,

Just reviewed the patch and it seems harmless. I think we should commit this.

Regards,
Akshat Jaimini

The new status of this patch is: Ready for Committer

Tom Lane

tgl@sss.pgh.pa.us

3 days ago

In reply to: Akshat Jaimini (#3)

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

Akshat Jaimini <destrex271@gmail.com> writes:

Just reviewed the patch and it seems harmless. I think we should commit this.

I took another look and I still don't like it. The intent of the code
is that once we're out of a query, a query cancel that arrived
meanwhile should be a no-op. I completely disagree with the idea that
a late-arriving cancel should cancel the next query. (From the user's
standpoint: if you're a bit slow to hit control-C and so your query
finishes anyway, and then you go for coffee, would you be surprised
to come back awhile later and enter a new query and have it fail?)
And I think the same should apply to statement timeouts: if one
happens slightly too late to cancel the current query, that does not
mean it should cancel the next one.

That's not to say that I don't think there's anything to do about
this. I just don't think this is what to do.

Looking around at the code, I think perhaps some blame should fall
on this bit in PostgresMain:

/*
* (5) disable async signal conditions again.
*
* Query cancel is supposed to be a no-op when there is no query in
* progress, so if a query cancel arrived while we were idle, just
* reset QueryCancelPending. ProcessInterrupts() has that effect when
* it's called when DoingCommandRead is set, so check for interrupts
* before resetting DoingCommandRead.
*/
CHECK_FOR_INTERRUPTS();
DoingCommandRead = false;

We are taking care to discard a stale QueryCancelPending flag, but
this code is ignorant of the fact that timeouts now also feed into
QueryCancelPending. Perhaps we should clear the timeout indicators
here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
we should just clear QueryCancelPending directly (and clear the
indicators too, likely).

Also, I could buy making the proposed change in
disable_statement_timeout without the one in enable_statement_timeout.
I think that would have the desired effect that if ProcessInterrupts
runs shortly after disable_statement_timeout it would not lie about
the cause of the cancel, while not sacrificing the principle that
a statement timeout can't kill the next statement.

As a more thorough redesign, maybe we should get rid of the
idea that QueryCancelPending represents all three of these
conditions. That would mean a bit more duplication of logic
in ProcessInterrupts, but I don't think that code is especially
performance-critical.

regards, tom lane

Jelte Fennema-Nio

postgres@jeltef.nl

2 days ago

In reply to: Tom Lane (#4)

1 attachment(s)

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took another look and I still don't like it. The intent of the code
is that once we're out of a query, a query cancel that arrived
meanwhile should be a no-op. I completely disagree with the idea that
a late-arriving cancel should cancel the next query. (From the user's
standpoint: if you're a bit slow to hit control-C and so your query
finishes anyway, and then you go for coffee, would you be surprised
to come back awhile later and enter a new query and have it fail?)
And I think the same should apply to statement timeouts: if one
happens slightly too late to cancel the current query, that does not
mean it should cancel the next one.

The fact that you're thinking that my intent of the patch was this, that
means I did a terrible job at explaining the bug that this patch is
fixing. So I'll try doing a better job now.

The bug this is fixing has nothing to do with multiple queries. It
occurs within a single query being processed by exec_simple_query:
1. At line 1051 start_xact_command is called and then calls
enable_statement_timeout to start the timeout for the query.
2. The code continues to line 1148 where start_xact_command is called AGAIN.
3. Now the timeout occurs (could also happen a bit earlier, as long as
no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
indicator and calls remove_timeout_index to mark the timeout as
inactive.
4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
5. Because get_timeout_active returns false now due to the timeout,
enable_timeout_after is called again which resets the indicator flag
6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
QueryCancelPending is still set (correctly), but the matching
indicator flag was lost.

The problem is that while the comment above 4 shows that the intent of
calling enable_statement_timeout again without an interceding
finish_xact_command is a no-op:

/*
* Start statement timeout if necessary. Note that this'll intentionally
* not reset the clock on an already started timeout, to avoid the timing
* overhead when start_xact_command() is invoked repeatedly, without an
* interceding finish_xact_command() (e.g. parse/bind/execute). If that's
* not desired, the timeout has to be disabled explicitly.
*/

But that's not actually the case if you run into this race where the
timeout has already fired, because then It resets the just-before-set
indicator flag. Note that the comment also explains that this is not the
only place where this can happen.

/*
* (5) disable async signal conditions again.
*
* Query cancel is supposed to be a no-op when there is no query in
* progress, so if a query cancel arrived while we were idle, just
* reset QueryCancelPending. ProcessInterrupts() has that effect when
* it's called when DoingCommandRead is set, so check for interrupts
* before resetting DoingCommandRead.
*/
CHECK_FOR_INTERRUPTS();
DoingCommandRead = false;

We are taking care to discard a stale QueryCancelPending flag, but
this code is ignorant of the fact that timeouts now also feed into
QueryCancelPending. Perhaps we should clear the timeout indicators
here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
we should just clear QueryCancelPending directly (and clear the
indicators too, likely).

Thank you for pointing this out. I think (but I'm not exactly sure what
I was thinking over a year ago), that I had trusted the comment there,
and I had missed that my changes had inadvertently changed the behaviour
here because now the indicators could still be set at this
CHECK_FOR_INTERRUPTS call.

Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
2nd proposed option), because it's calling get_timeout_indicator with
the true flag. Though that's not enough, because it still throws an
error if they are set even if DoingCommandRead=true. So the newly
attached patch changes that behaviour instead by checking if
DoingCommandRead is false before throwing any error. Does that resolve
your concerns about the possibility of cancels leaking to the next
query? (which to be extremely clear, was unintended)

Also, I could buy making the proposed change in
disable_statement_timeout without the one in enable_statement_timeout.
I think that would have the desired effect that if ProcessInterrupts
runs shortly after disable_statement_timeout it would not lie about
the cause of the cancel, while not sacrificing the principle that
a statement timeout can't kill the next statement.

As explained above the one in enable_statement_timeout is exactly the
one that is required to fix the bug in question.

As a more thorough redesign, maybe we should get rid of the
idea that QueryCancelPending represents all three of these
conditions. That would mean a bit more duplication of logic
in ProcessInterrupts, but I don't think that code is especially
performance-critical.

I would like that very much, as I think it would make the code much
easier to follow (I think it's pretty hard to understand code
currently). The problem I see with actually doing that though, is that
StatementTimeoutHandler is sending SIGINT to the whole process group. So
unless we want to stop doing that, I don't see a way of separating the
indicators from QueryCancelPending.

Attachments:

v6-0001-Do-not-reset-statement_timeout-indicator-outside-.patchtext/x-patch; charset=utf-8; name=v6-0001-Do-not-reset-statement_timeout-indicator-outside-.patchDownload

From 1020cd6daf6e513420984d172026218502f99ec2 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 24 Jun 2024 00:29:39 +0200
Subject: [PATCH v6] Do not reset statement_timeout indicator outside of
 ProcessInterrupts

The only way that ProcessInterrupts can know why QueryCancelPending is
set is by looking at the indicator bits of the various timeouts. The
STATEMENT_TIMEOUT indicator was being reset in two places, causing
ProcessInterrupts to report "user request" instead of "statement
timeout".

The primary reason for this patch is that this different error message
caused a test to be flaky. The reason this was happening is because
enable_statement_timeout would be called twice for a single query. The
intent of calling enable_statement_timeout twice is that the second call
was a no-op, but that was not the case in practice. If the timer expired
right before the second call, then the timer would restart and unset the
current indicator. The QueryCancelPending flag would still be true, so
on the next CHECK_FOR_INTERRUPTS it would throw the "canceled statement
due to user request" error.

The disable_statement_timeout function has a similar theoretical race
condition, where it would reset the indicator, but leave the cancel flag
set. So that one was changed too, even though none of our tests actually
trigger that specific bug.

The downside of keeping these indicators set, is that they can leak
outside of the query that's being run. To make sure that this causes no
problems in practice, ProcessInterrupts now also checks that
DoingCommandRead is false before canceling because of statement_timeout
or lock_timeout, in addition to resetting of those indicators (which it
already did). Given that CHECK_FOR_INTERRUPTS is called right before
changing the DoingCommandRead back to false, this will reset the
indicators before entering the next query.
---
 src/backend/tcop/postgres.c | 44 ++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..b3b4afdac24 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3436,34 +3436,37 @@ ProcessInterrupts(void)
 			get_timeout_finish_time(STATEMENT_TIMEOUT) < get_timeout_finish_time(LOCK_TIMEOUT))
 			lock_timeout_occurred = false;	/* report stmt timeout */
 
-		if (lock_timeout_occurred)
+		if (DoingCommandRead)
+		{
+			/*
+			 * If we are reading a command from the client, just ignore the
+			 * cancel request --- sending an extra error message won't
+			 * accomplish anything.  Otherwise, go ahead and throw an error
+			 * with the correct reason.
+			 */
+		}
+		else if (lock_timeout_occurred)
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
 					 errmsg("canceling statement due to lock timeout")));
 		}
-		if (stmt_timeout_occurred)
+		else if (stmt_timeout_occurred)
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling statement due to statement timeout")));
 		}
-		if (AmAutoVacuumWorkerProcess())
+		else if (AmAutoVacuumWorkerProcess())
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling autovacuum task")));
 		}
-
-		/*
-		 * If we are reading a command from the client, just ignore the cancel
-		 * request --- sending an extra error message won't accomplish
-		 * anything.  Otherwise, go ahead and throw the error.
-		 */
-		if (!DoingCommandRead)
+		else
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
@@ -5218,22 +5221,33 @@ enable_statement_timeout(void)
 	if (StatementTimeout > 0
 		&& (StatementTimeout < TransactionTimeout || TransactionTimeout == 0))
 	{
-		if (!get_timeout_active(STATEMENT_TIMEOUT))
+		/*
+		 * We check both if it's active or if it's already triggered. If it's
+		 * already triggered we don't want to restart it because that clears
+		 * the indicator flag, which in turn would cause the wrong error
+		 * message to be used by ProcessInterrupts() on the next
+		 * CHECK_FOR_INTERRUPTS() call. Restarting the timer in that case
+		 * would be pointless anyway, because the statement timeout error is
+		 * going to trigger on the next CHECK_FOR_INTERRUPTS() call.
+		 */
+		if (!get_timeout_active(STATEMENT_TIMEOUT)
+			&& !get_timeout_indicator(STATEMENT_TIMEOUT, false))
 			enable_timeout_after(STATEMENT_TIMEOUT, StatementTimeout);
 	}
 	else
 	{
-		if (get_timeout_active(STATEMENT_TIMEOUT))
-			disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_statement_timeout();
 	}
 }
 
 /*
- * Disable statement timeout, if active.
+ * Disable statement timeout, if active. We preserve the indicator flag
+ * though, otherwise we'd lose the knowledge in ProcessInterupts that the
+ * SIGINT came from a statement timeout.
  */
 static void
 disable_statement_timeout(void)
 {
 	if (get_timeout_active(STATEMENT_TIMEOUT))
-		disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_timeout(STATEMENT_TIMEOUT, true);
 }

base-commit: e5a5e0a90750d665cab417322b9f85c806430d85
-- 
2.52.0

Japin Li

japinli@hotmail.com

1 day ago

In reply to: Jelte Fennema-Nio (#5)

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

Hi, Jelte

On Fri, 09 Jan 2026 at 23:08, "Jelte Fennema-Nio" <postgres@jeltef.nl> wrote:

On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took another look and I still don't like it. The intent of the code
is that once we're out of a query, a query cancel that arrived
meanwhile should be a no-op. I completely disagree with the idea that
a late-arriving cancel should cancel the next query. (From the user's
standpoint: if you're a bit slow to hit control-C and so your query
finishes anyway, and then you go for coffee, would you be surprised
to come back awhile later and enter a new query and have it fail?)
And I think the same should apply to statement timeouts: if one
happens slightly too late to cancel the current query, that does not
mean it should cancel the next one.

The fact that you're thinking that my intent of the patch was this, that
means I did a terrible job at explaining the bug that this patch is
fixing. So I'll try doing a better job now.

The bug this is fixing has nothing to do with multiple queries. It
occurs within a single query being processed by exec_simple_query:
1. At line 1051 start_xact_command is called and then calls
enable_statement_timeout to start the timeout for the query.
2. The code continues to line 1148 where start_xact_command is called AGAIN.
3. Now the timeout occurs (could also happen a bit earlier, as long as
no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
indicator and calls remove_timeout_index to mark the timeout as
inactive.
4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
5. Because get_timeout_active returns false now due to the timeout,
enable_timeout_after is called again which resets the indicator flag
6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
QueryCancelPending is still set (correctly), but the matching
indicator flag was lost.

The problem is that while the comment above 4 shows that the intent of
calling enable_statement_timeout again without an interceding
finish_xact_command is a no-op:

/*
* Start statement timeout if necessary. Note that this'll intentionally
* not reset the clock on an already started timeout, to avoid the timing
* overhead when start_xact_command() is invoked repeatedly, without an
* interceding finish_xact_command() (e.g. parse/bind/execute). If that's
* not desired, the timeout has to be disabled explicitly.
*/

But that's not actually the case if you run into this race where the
timeout has already fired, because then It resets the just-before-set
indicator flag. Note that the comment also explains that this is not the
only place where this can happen.

/*
* (5) disable async signal conditions again.
*
* Query cancel is supposed to be a no-op when there is no query in
* progress, so if a query cancel arrived while we were idle, just
* reset QueryCancelPending. ProcessInterrupts() has that effect when
* it's called when DoingCommandRead is set, so check for interrupts
* before resetting DoingCommandRead.
*/
CHECK_FOR_INTERRUPTS();
DoingCommandRead = false;

We are taking care to discard a stale QueryCancelPending flag, but
this code is ignorant of the fact that timeouts now also feed into
QueryCancelPending. Perhaps we should clear the timeout indicators
here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
we should just clear QueryCancelPending directly (and clear the
indicators too, likely).

Thank you for pointing this out. I think (but I'm not exactly sure what
I was thinking over a year ago), that I had trusted the comment there,
and I had missed that my changes had inadvertently changed the behaviour
here because now the indicators could still be set at this
CHECK_FOR_INTERRUPTS call.

Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
2nd proposed option), because it's calling get_timeout_indicator with
the true flag. Though that's not enough, because it still throws an
error if they are set even if DoingCommandRead=true. So the newly
attached patch changes that behaviour instead by checking if
DoingCommandRead is false before throwing any error. Does that resolve
your concerns about the possibility of cancels leaking to the next
query? (which to be extremely clear, was unintended)

Also, I could buy making the proposed change in
disable_statement_timeout without the one in enable_statement_timeout.
I think that would have the desired effect that if ProcessInterrupts
runs shortly after disable_statement_timeout it would not lie about
the cause of the cancel, while not sacrificing the principle that
a statement timeout can't kill the next statement.

As explained above the one in enable_statement_timeout is exactly the
one that is required to fix the bug in question.

As a more thorough redesign, maybe we should get rid of the
idea that QueryCancelPending represents all three of these
conditions. That would mean a bit more duplication of logic
in ProcessInterrupts, but I don't think that code is especially
performance-critical.

I would like that very much, as I think it would make the code much
easier to follow (I think it's pretty hard to understand code
currently). The problem I see with actually doing that though, is that
StatementTimeoutHandler is sending SIGINT to the whole process group. So
unless we want to stop doing that, I don't see a way of separating the
indicators from QueryCancelPending.

I recently encountered this error [0]/messages/by-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com. I'm wondering if we could add a test
case to cover this situation.

[0]: /messages/by-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.