Transaction timeout
Hello,
We have statement_timeout, idle_in_transaction_timeout,
idle_session_timeout and many more! But we have no
transaction_timeout. I've skimmed thread [0,1] about existing timeouts
and found no contraindications to implement transaction_timeout.
Nikolay asked me if I can prototype the feature for testing by him,
and it seems straightforward. Please find attached. If it's not known
to be a bad idea - we'll work on it.
Thanks!
Best regards, Andrey Borodin.
[0]: /messages/by-id/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
Attachments:
v1-0001-Intorduce-transaction_timeout.patchapplication/octet-stream; name=v1-0001-Intorduce-transaction_timeout.patchDownload+52-1
On Fri, Dec 2, 2022 at 9:18 PM Andrey Borodin <amborodin86@gmail.com> wrote:
Hello,
We have statement_timeout, idle_in_transaction_timeout,
idle_session_timeout and many more! But we have no
transaction_timeout. I've skimmed thread [0,1] about existing timeouts
and found no contraindications to implement transaction_timeout.Nikolay asked me if I can prototype the feature for testing by him,
and it seems straightforward. Please find attached. If it's not known
to be a bad idea - we'll work on it.
Thanks!! It was a super quick reaction to my proposal Honestly, I was
thinking about it for several years, wondering why it's still not
implemented.
The reasons to have it should be straightforward – here are a couple of
them I can see.
First one. In the OLTP context, we usually have:
- a hard timeout set in application server
- a hard timeout set in HTTP server
- users not willing to wait more than several seconds – and almost never
being ready to wait for more than 30-60s.
If Postgres allows longer transactions (it does since we cannot reliably
limit their duration now, it's always virtually unlimited), it might be
doing the work that nobody is waiting / is needing anymore, speeding
resources, affecting health, etc.
Why we cannot limit transaction duration reliably? The existing timeouts
(namely, statement_timeout + idle_session_timeout) don't protect from
having transactions consisting of a series of small statements and short
pauses between them. If such behavior happens (e.g., a long series of fast
UPDATEs in a loop). It can be dangerous, affecting general DB health (bloat
issues). This is reason number two – DBAs might want to decide to minimize
the cases of long transactions, setting transaction limits globally (and
allowing to set it locally for particular sessions or for some users in
rare cases).
Speaking of the patch – I just tested it (gitpod:
https://gitpod.io/#https://gitlab.com/NikolayS/postgres/tree/transaction_timeout);
it applies, works as expected for single-statement transactions:
postgres=# set transaction_timeout to '2s';
SET
postgres=# select pg_sleep(3);
ERROR: canceling transaction due to transaction timeout
But it fails in the "worst" case I've described above – a series of small
statements:
postgres=# set transaction_timeout to '2s';
SET
postgres=# begin; select pg_sleep(1); select pg_sleep(1); select
pg_sleep(1); select pg_sleep(1); select pg_sleep(1); commit;
BEGIN
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
COMMIT
postgres=#
On Fri, Dec 2, 2022 at 10:59 PM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
But it fails in the "worst" case I've described above – a series of small statements:
Fixed. Added test for this.
Open questions:
1. Docs
2. Order of reporting if happened lock_timeout, statement_timeout, and
transaction_timeout simultaneously. Currently there's a lot of code
around this...
Thanks!
Best regards, Andrey Borodin.
Attachments:
v2-0001-Intorduce-transaction_timeout.patchapplication/octet-stream; name=v2-0001-Intorduce-transaction_timeout.patchDownload+74-1
On Sat, Dec 3, 2022 at 9:41 AM Andrey Borodin <amborodin86@gmail.com> wrote:
Fixed. Added test for this.
Thanks! Tested (gitpod:
https://gitpod.io/#https://gitlab.com/NikolayS/postgres/tree/transaction_timeout-v2
),
works as expected.
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested
Tested, works as expected;
documentation is not yet added
Hi,
On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote:
@@ -2720,6 +2723,7 @@ finish_xact_command(void)
if (xact_started)
{
+
CommitTransactionCommand();#ifdef MEMORY_CONTEXT_CHECKING
Spurious newline added.
@@ -4460,6 +4473,10 @@ PostgresMain(const char *dbname, const char *username) enable_timeout_after(IDLE_SESSION_TIMEOUT, IdleSessionTimeout); } + + + if (get_timeout_active(TRANSACTION_TIMEOUT)) + disable_timeout(TRANSACTION_TIMEOUT, false); }/* Report any recently-changed GUC options */
Too many newlines added.
I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.
I'm somewhat inclined to think that we need some redesign here before we
add more overhead.
@@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
SetLatch(MyLatch);
}+static void +TransactionTimeoutHandler(void) +{ +#ifdef HAVE_SETSID + /* try to signal whole process group */ + kill(-MyProcPid, SIGINT); +#endif + kill(MyProcPid, SIGINT); +} +
Why does this use signals instead of just setting the latch like
IdleInTransactionSessionTimeoutHandler() etc?
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 0081873a72..5229fe3555 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) ahprintf(AH, "SET statement_timeout = 0;\n"); ahprintf(AH, "SET lock_timeout = 0;\n"); ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); + ahprintf(AH, "SET transaction_timeout = 0;\n");
Hm - why is that the right thing to do?
diff --git a/src/test/isolation/specs/timeouts.spec b/src/test/isolation/specs/timeouts.spec index c747b4ae28..a7f27811c7 100644 --- a/src/test/isolation/specs/timeouts.spec +++ b/src/test/isolation/specs/timeouts.spec @@ -23,6 +23,9 @@ step sto { SET statement_timeout = '10ms'; } step lto { SET lock_timeout = '10ms'; } step lsto { SET lock_timeout = '10ms'; SET statement_timeout = '10s'; } step slto { SET lock_timeout = '10s'; SET statement_timeout = '10ms'; } +step tto { SET transaction_timeout = '10ms'; } +step sleep0 { SELECT pg_sleep(0.0001) } +step sleep10 { SELECT pg_sleep(0.01) } step locktbl { LOCK TABLE accounts; } step update { DELETE FROM accounts WHERE accountid = 'checking'; } teardown { ABORT; } @@ -47,3 +50,5 @@ permutation wrtbl lto update(*) permutation wrtbl lsto update(*) # statement timeout expires first, row-level lock permutation wrtbl slto update(*) +# transaction timeout +permutation tto sleep0 sleep0 sleep10(*) \ No newline at end of file
I don't think this is quite sufficient. I think the test should verify
that transaction timeout interacts correctly with statement timeout /
idle in tx timeout.
Greetings,
Andres Freund
Thanks for looking into this Andres!
On Mon, Dec 5, 2022 at 3:07 PM Andres Freund <andres@anarazel.de> wrote:
I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.I'm somewhat inclined to think that we need some redesign here before we
add more overhead.
We can cap statement_timeout\idle_session_timeout by the budget of
transaction_timeout left.
Either way we can do batch function enable_timeouts() instead
enable_timeout_after().
Does anything of it make sense?
@@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
SetLatch(MyLatch);
}+static void +TransactionTimeoutHandler(void) +{ +#ifdef HAVE_SETSID + /* try to signal whole process group */ + kill(-MyProcPid, SIGINT); +#endif + kill(MyProcPid, SIGINT); +} +Why does this use signals instead of just setting the latch like
IdleInTransactionSessionTimeoutHandler() etc?
I just copied statement_timeout behaviour. As I understand this
implementation is prefered if the timeout can catch the backend
running at full steam.
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 0081873a72..5229fe3555 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) ahprintf(AH, "SET statement_timeout = 0;\n"); ahprintf(AH, "SET lock_timeout = 0;\n"); ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); + ahprintf(AH, "SET transaction_timeout = 0;\n");Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.
Thank you!
Best regards, Andrey Borodin.
Hi,
On 2022-12-05 15:41:29 -0800, Andrey Borodin wrote:
Thanks for looking into this Andres!
On Mon, Dec 5, 2022 at 3:07 PM Andres Freund <andres@anarazel.de> wrote:
I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.I'm somewhat inclined to think that we need some redesign here before we
add more overhead.We can cap statement_timeout\idle_session_timeout by the budget of
transaction_timeout left.
I don't know what you mean by that.
@@ -3277,6 +3282,7 @@ ProcessInterrupts(void) */ lock_timeout_occurred = get_timeout_indicator(LOCK_TIMEOUT, true); stmt_timeout_occurred = get_timeout_indicator(STATEMENT_TIMEOUT, true); + tx_timeout_occurred = get_timeout_indicator(TRANSACTION_TIMEOUT, true);/*
* If both were set, we want to report whichever timeout completed
This doesn't update the preceding comment, btw, which now reads oddly:
/*
* If LOCK_TIMEOUT and STATEMENT_TIMEOUT indicators are both set, we
* need to clear both, so always fetch both.
*/
@@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
SetLatch(MyLatch);
}+static void +TransactionTimeoutHandler(void) +{ +#ifdef HAVE_SETSID + /* try to signal whole process group */ + kill(-MyProcPid, SIGINT); +#endif + kill(MyProcPid, SIGINT); +} +Why does this use signals instead of just setting the latch like
IdleInTransactionSessionTimeoutHandler() etc?I just copied statement_timeout behaviour. As I understand this
implementation is prefered if the timeout can catch the backend
running at full steam.
Hm. I'm not particularly convinced by that code. Be that as it may, I
don't think it's a good idea to have one more copy of this code. At
least the patch should wrap the signalling code in a helper.
FWIW, the HAVE_SETSID code originates in:
commit 3ad0728c817bf8abd2c76bd11d856967509b307c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2006-11-21 20:59:53 +0000
On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process. This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command. Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system(). (There is no support in the
core backend for that, but it's widely done using untrusted PLs.) Per gripe
from Stephen Harris and subsequent discussion.
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 0081873a72..5229fe3555 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) ahprintf(AH, "SET statement_timeout = 0;\n"); ahprintf(AH, "SET lock_timeout = 0;\n"); ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); + ahprintf(AH, "SET transaction_timeout = 0;\n");Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.
I guess it's just following precedent - but it seems a bit presumptuous
to just disable safety settings a DBA might have set up. That makes some
sense for e.g. idle_in_transaction_session_timeout, because I think
e.g. parallel backup can lead to a connection being idle for a bit.
A few more review comments:
Either way we can do batch function enable_timeouts() instead
enable_timeout_after().
Does anything of it make sense?
I'm at least as worried about the various calls *after* the execution of
a statement.
+ if (tx_timeout_occurred) + { + LockErrorCleanup(); + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_TIMEOUT), + errmsg("canceling transaction due to transaction timeout"))); + }
The number of calls to LockErrorCleanup() here feels wrong - there's
already 8 calls in ProcessInterrupts(). Besides the code duplication I
also think it's not a sane idea to rely on having LockErrorCleanup()
before all the relevant ereport(ERROR)s.
Greetings,
Andres Freund
At Mon, 5 Dec 2022 15:07:47 -0800, Andres Freund <andres@anarazel.de> wrote in
I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.I'm somewhat inclined to think that we need some redesign here before we
add more overhead.
insert_timeout() and remove_timeout_index() move 40*(# of several
timeouts) bytes at every enabling/disabling a timeout. This is far
frequent than actually any timeout fires. schedule_alarm() is
interested only in the nearest timeout.
So, we can get rid of the insertion sort in
insert_timeout/remove_timeout_index then let them just search for the
nearest one and remember it. Finding the nearest should be faster than
the insertion sort. Instead we need to scan over the all timeouts
instead of the a few first ones, but that's overhead is not a matter
when a timeout fires.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi,
On 2022-12-06 09:44:01 +0900, Kyotaro Horiguchi wrote:
At Mon, 5 Dec 2022 15:07:47 -0800, Andres Freund <andres@anarazel.de> wrote in
I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.I'm somewhat inclined to think that we need some redesign here before we
add more overhead.insert_timeout() and remove_timeout_index() move 40*(# of several
timeouts) bytes at every enabling/disabling a timeout. This is far
frequent than actually any timeout fires. schedule_alarm() is
interested only in the nearest timeout.
So, we can get rid of the insertion sort in
insert_timeout/remove_timeout_index then let them just search for the
nearest one and remember it. Finding the nearest should be faster than
the insertion sort. Instead we need to scan over the all timeouts
instead of the a few first ones, but that's overhead is not a matter
when a timeout fires.
I'm most concerned about the overhead when the timeouts are *not*
enabled. And this adds a branch to start_xact_command() and a function
call for get_timeout_active(TRANSACTION_TIMEOUT) in that case. On its
own, that's not a whole lot, but it does add up. There's 10+ function
calls for timeout and ps_display purposes for every single statement.
But it's definitely also worth optimizing the timeout enabled paths. And
you're right, it looks like there's a fair bit of optimization
potential.
Greetings,
Andres Freund
At Mon, 5 Dec 2022 17:10:50 -0800, Andres Freund <andres@anarazel.de> wrote in
I'm most concerned about the overhead when the timeouts are *not*
enabled. And this adds a branch to start_xact_command() and a function
call for get_timeout_active(TRANSACTION_TIMEOUT) in that case. On its
own, that's not a whole lot, but it does add up. There's 10+ function
calls for timeout and ps_display purposes for every single statement.
That path seems like existing just for robustness. I inserted
"Assert(0)" just before the disable_timeout(), but make check-world
didn't fail [1]# 032_apply_delay.pl fails for me so I don't know any of the later # tests fails.. Couldn't we get rid of that path, adding an assertion
instead? I'm not sure about other timeouts yet, though.
About disabling side, we cannot rely on StatementTimeout.
[1]: # 032_apply_delay.pl fails for me so I don't know any of the later # tests fails.
# 032_apply_delay.pl fails for me so I don't know any of the later
# tests fails.
But it's definitely also worth optimizing the timeout enabled paths. And
you're right, it looks like there's a fair bit of optimization
potential.
Thanks. I'll work on that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi,
On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote:
Fixed. Added test for this.
The tests don't pass: https://cirrus-ci.com/build/4811553145356288
[00:54:35.337](1.251s) not ok 1 - no parameters missing from postgresql.conf.sample
[00:54:35.338](0.000s) # Failed test 'no parameters missing from postgresql.conf.sample'
# at /tmp/cirrus-ci-build/src/test/modules/test_misc/t/003_check_guc.pl line 81.
[00:54:35.338](0.000s) # got: '1'
# expected: '0'
I am just looking through a bunch of failing CF entries, so I'm perhaps
over-sensitized right now. But I'm a bit confused why there's so many
occasions of the tests clearly not having been run...
Michael, any reason 003_check_guc doesn't show the missing GUCs? It's not
particularly helpful to see "0 is different from 1". Seems that even just
something like
is_deeply(\@missing_from_file, [], "no parameters missing from postgresql.conf.sample");
would be a decent improvement?
Greetings,
Andres Freund
On Wed, Dec 7, 2022 at 10:23 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote:
Fixed. Added test for this.
The tests don't pass: https://cirrus-ci.com/build/4811553145356288
oops, sorry. Here's the fix. I hope to address other feedback on the
weekend. Thank you!
Best regards, Andrey Borodin.
Attachments:
v3-0001-Intorduce-transaction_timeout.patchapplication/octet-stream; name=v3-0001-Intorduce-transaction_timeout.patchDownload+75-1
On Wed, Dec 7, 2022 at 1:30 PM Andrey Borodin <amborodin86@gmail.com> wrote:
I hope to address other feedback on the weekend.
Andres, here's my progress on working with your review notes.
@@ -3277,6 +3282,7 @@ ProcessInterrupts(void) */ lock_timeout_occurred = get_timeout_indicator(LOCK_TIMEOUT, true); stmt_timeout_occurred = get_timeout_indicator(STATEMENT_TIMEOUT, true); + tx_timeout_occurred = get_timeout_indicator(TRANSACTION_TIMEOUT, true);/*
* If both were set, we want to report whichever timeout completedThis doesn't update the preceding comment, btw, which now reads oddly:
I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.
@@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
SetLatch(MyLatch);
}+static void +TransactionTimeoutHandler(void) +{ +#ifdef HAVE_SETSID + /* try to signal whole process group */ + kill(-MyProcPid, SIGINT); +#endif + kill(MyProcPid, SIGINT); +} +Why does this use signals instead of just setting the latch like
IdleInTransactionSessionTimeoutHandler() etc?I just copied statement_timeout behaviour. As I understand this
implementation is prefered if the timeout can catch the backend
running at full steam.Hm. I'm not particularly convinced by that code. Be that as it may, I
don't think it's a good idea to have one more copy of this code. At
least the patch should wrap the signalling code in a helper.
Done, now there is a single CancelOnTimeoutHandler() handler.
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 0081873a72..5229fe3555 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) ahprintf(AH, "SET statement_timeout = 0;\n"); ahprintf(AH, "SET lock_timeout = 0;\n"); ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); + ahprintf(AH, "SET transaction_timeout = 0;\n");Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.
I guess it's just following precedent - but it seems a bit presumptuous
to just disable safety settings a DBA might have set up. That makes some
sense for e.g. idle_in_transaction_session_timeout, because I think
e.g. parallel backup can lead to a connection being idle for a bit.
I do not know. My reasoning - everywhere we turn off
statement_timeout, we should turn off transaction_timeout too.
But I have no strong opinion here. I left this code as is in the patch
so far. For the same reason I did not change anything in
pg_backup_archiver.c.
Either way we can do batch function enable_timeouts() instead
enable_timeout_after().Does anything of it make sense?
I'm at least as worried about the various calls *after* the execution of
a statement.
I think this code is just a one bit check
if (get_timeout_active(TRANSACTION_TIMEOUT))
inside of get_timeout_active(). With all 14 timeouts we have, I don't
see a good way to optimize stuff so far.
+ if (tx_timeout_occurred) + { + LockErrorCleanup(); + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_TIMEOUT), + errmsg("canceling transaction due to transaction timeout"))); + }The number of calls to LockErrorCleanup() here feels wrong - there's
already 8 calls in ProcessInterrupts(). Besides the code duplication I
also think it's not a sane idea to rely on having LockErrorCleanup()
before all the relevant ereport(ERROR)s.
I've refactored that code down to 7 calls of LockErrorCleanup() :)
Logic behind various branches is not clear for me, e.g. why we do not
LockErrorCleanup() when reading commands from a client?
So I did not risk refactoring further.
I think the test should verify
that transaction timeout interacts correctly with statement timeout /
idle in tx timeout.
I've added tests that check statement_timeout vs transaction_timeout.
However I could not produce stable tests with
idle_in_transaction_timeout vs transaction_timeout so far. But I'll
look into this more.
Actually, stabilizing statement_timeout vs transaction_timeout was
tricky on Windows too. I had to remove the second call to
pg_sleep(0.0001) because it was triggering 10ьs timeout from time to
time. Also, test timeout was increased to 30ms, because unlike others
in spec it's not supposed to happen at the very first SQL statement.
Thank you!
Best regards, Andrey Borodin.
Attachments:
v4-0001-Intorduce-transaction_timeout.patchapplication/octet-stream; name=v4-0001-Intorduce-transaction_timeout.patchDownload+110-29
On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.
I suspect this will make translation difficult.
+ ahprintf(AH, "SET transaction_timeout = 0;\n");
Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.
I guess it's just following precedent - but it seems a bit presumptuous
to just disable safety settings a DBA might have set up. That makes some
sense for e.g. idle_in_transaction_session_timeout, because I think
e.g. parallel backup can lead to a connection being idle for a bit.I do not know. My reasoning - everywhere we turn off
statement_timeout, we should turn off transaction_timeout too.
But I have no strong opinion here. I left this code as is in the patch
so far. For the same reason I did not change anything in
pg_backup_archiver.c.
From 8383486's commit message:
We disable statement_timeout and lock_timeout during dump and restore,
to prevent any global settings that might exist from breaking routine
backups.
I imagine changing this could disrupt existing servers that depend on these
overrides during backups, although I think Andres has a good point about
disabling safety settings. This might be a good topic for another thread.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 12, 2023 at 11:24 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.I suspect this will make translation difficult.
I use special functions for this like _()
char* lock_reason = lock_timeout_occurred ? _("lock timeout") : "";
and then
ereport(ERROR, (errcode(err_code),
errmsg("canceling statement due to %s%s%s%s%s", lock_reason, comma1,
stmt_reason, comma2, tx_reason)));
I hope it will be translatable...
+ ahprintf(AH, "SET transaction_timeout = 0;\n");
Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.
I guess it's just following precedent - but it seems a bit presumptuous
to just disable safety settings a DBA might have set up. That makes some
sense for e.g. idle_in_transaction_session_timeout, because I think
e.g. parallel backup can lead to a connection being idle for a bit.I do not know. My reasoning - everywhere we turn off
statement_timeout, we should turn off transaction_timeout too.
But I have no strong opinion here. I left this code as is in the patch
so far. For the same reason I did not change anything in
pg_backup_archiver.c.From 8383486's commit message:
We disable statement_timeout and lock_timeout during dump and restore,
to prevent any global settings that might exist from breaking routine
backups.I imagine changing this could disrupt existing servers that depend on these
overrides during backups, although I think Andres has a good point about
disabling safety settings. This might be a good topic for another thread.
+1.
Thanks for the review!
Best regards, Andrey Borodin.
On Thu, Jan 12, 2023 at 11:47 AM Andrey Borodin <amborodin86@gmail.com>
wrote:
On Thu, Jan 12, 2023 at 11:24 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.
Testing it again, a couple of questions
1) The current test set has only 2 simple cases – I'd suggest adding one
more (that one that didn't work in v1):
gitpod=# set transaction_timeout to '20ms';
SET
gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select
pg_sleep(.01); commit;
BEGIN
pg_sleep
----------
(1 row)
ERROR: canceling statement due to transaction timeout
gitpod=# set statement_timeout to '20ms'; set transaction_timeout to 0; --
to test value for statement_timeout and see that it doesn't fail
SET
SET
gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select
pg_sleep(.01); commit;
BEGIN
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
pg_sleep
----------
(1 row)
COMMIT
2) Testing for a longer transaction (2 min), in a gitpod VM (everything is
local, no network involved)
// not sure what's happening here, maybe some overheads that are not
related to the implementation,
// but the goal was to see how precise the limiting is for longer
transactions
gitpod=# set transaction_timeout to '2min';
SET
gitpod=# begin;
BEGIN
gitpod=*# select now(), clock_timestamp(), pg_sleep(3) \watch 1
Fri 13 Jan 2023 03:49:24 PM UTC (every 1s)
now | clock_timestamp | pg_sleep
-------------------------------+-------------------------------+----------
2023-01-13 15:49:22.906924+00 | 2023-01-13 15:49:24.088728+00 |
(1 row)
[...]
Fri 13 Jan 2023 03:51:18 PM UTC (every 1s)
now | clock_timestamp | pg_sleep
-------------------------------+-------------------------------+----------
2023-01-13 15:49:22.906924+00 | 2023-01-13 15:51:18.179579+00 |
(1 row)
ERROR: canceling statement due to transaction timeout
gitpod=!#
gitpod=!# rollback;
ROLLBACK
gitpod=# select timestamptz '2023-01-13 15:51:18.179579+00' - '2023-01-13
15:49:22.906924+00';
?column?
-----------------
00:01:55.272655
(1 row)
gitpod=# select interval '2min' - '00:01:55.272655';
?column?
-----------------
00:00:04.727345
(1 row)
gitpod=# select interval '2min' - '00:01:55.272655' - '4s';
?column?
-----------------
00:00:00.727345
(1 row)
– it seems we could (should) have one more successful "1s wait, 3s sleep"
iteration here, ~727ms somehow wasted in a loop, quite a lot.
Thanks for the review Nikolay!
On Fri, Jan 13, 2023 at 8:03 AM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
1) The current test set has only 2 simple cases – I'd suggest adding one more (that one that didn't work in v1):
gitpod=# set transaction_timeout to '20ms';
SET
gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select pg_sleep(.01); commit;
I tried exactly these tests - tests were unstable on Windows. Maybe
that OS has a more coarse-grained timer resolution.
It's a tradeoff between time spent on tests, strength of the test and
probability of false failure. I chose small time without false alarms.
– it seems we could (should) have one more successful "1s wait, 3s sleep" iteration here, ~727ms somehow wasted in a loop, quite a lot.
I think big chunk from these 727ms were spent between "BEGIN" and
"select now(), clock_timestamp(), pg_sleep(3) \watch 1". I doubt patch
really contains arithmetic errors.
Many thanks for looking into this!
Best regards, Andrey Borodin.
On Fri, Jan 13, 2023 at 10:16 AM Andrey Borodin <amborodin86@gmail.com>
wrote:
– it seems we could (should) have one more successful "1s wait, 3s
sleep" iteration here, ~727ms somehow wasted in a loop, quite a lot.
I think big chunk from these 727ms were spent between "BEGIN" and
"select now(), clock_timestamp(), pg_sleep(3) \watch 1".
Not really – there was indeed ~2s delay between BEGIN and the first
pg_sleep query, but those ~727ms is something else.
here we measure the remainder between the beginning of the transaction
measured by "now()' and the the beginning of the last successful pg_sleep()
query:
gitpod=# select timestamptz '2023-01-13 15:51:18.179579+00' - '2023-01-13
15:49:22.906924+00';
?column?
-----------------
00:01:55.272655
(1 row)
It already includes all delays that we had from the beginning of our
transaction.
The problem with my question was that I didn't take into attention that
'2023-01-13 15:51:18.179579+00' is when the last successful query
*started*. So the remainder of our 2-min quota – 00:00:04.727345 – includes
the last successful loop (3s of successful query + 1s of waiting), and then
we have failed after ~700ms.
In other words, there are no issues here, all good.
Many thanks for looking into this!
many thanks for implementing it
On 12.01.23 20:46, Andrey Borodin wrote:
On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.I suspect this will make translation difficult.
I use special functions for this like _()
char* lock_reason = lock_timeout_occurred ? _("lock timeout") : "";
and then
ereport(ERROR, (errcode(err_code),
errmsg("canceling statement due to %s%s%s%s%s", lock_reason, comma1,
stmt_reason, comma2, tx_reason)));I hope it will be translatable...
No, you can't do that. You have to write out all the strings separately.