In case of network issues, how long before archive_command does retries
I've got a setup where archive_command will gzip the wal archive to a
directory that is itself an NFS mount.
When connection is gone or blocked, archive_command fails after the timeout
specified by the NFS mount, as expected. (for a soft mount. hard mount
hangs, as expected)
However, on restoring connection, it's not clear to me how long it takes
before the command is retried.
Experience says "a few minutes", but I can't find documentation on an exact
algorithm.
To be clear, the question is: if archive_command fails, what are the
specifics of retrying? Is there a timeout? How is that timeout defined?
Is this detailed somewhere? Perhaps in the source code? I couldn't find it
in the documentation.
For detail, I'm using postgres 11, running on Ubuntu 20.
Regards,
Koen
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
I've got a setup where archive_command will gzip the wal archive to a directory that is itself an NFS mount.
When connection is gone or blocked, archive_command fails after the timeout specified by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)
However, on restoring connection, it's not clear to me how long it takes before the command is retried.
Experience says "a few minutes", but I can't find documentation on an exact algorithm.
To be clear, the question is: if archive_command fails, what are the specifics of retrying? Is there a timeout? How is that timeout defined?
Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
For detail, I'm using postgres 11, running on Ubuntu 20.
You can find the details in "src/backend/postmaster/pgarch.c".
The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Hello Laurenz,
Thanks for the reply. That would mean the source code is here:
https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c
Just to be sure, the "signal" you speak of, this is the result of the
command executed by archive_command?
If my understanding of the code is right, if no SIGTERM or other signal
arrives, it won't ever happen that a walarchive is skipped if the
archive_command fails too many times or takes too long? It will simply
check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60
seconds the point where it stops trying, waiting for the next time
archive_command is invoked?
I'm assuming that as long as the file is still in the pg_wal directory and
as long as there is no ".done" file for that walarchive under
pg_wal/archive_status, it will keep trying forever(or until someone
forcefully switches the timeline with for instance a basebackup)?
Apologies, I already sent this message once, but only to Laurenz. Sending
again to have it in the archives.
Regards,
Koen
On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
I've got a setup where archive_command will gzip the wal archive to a
directory that is itself an NFS mount.
When connection is gone or blocked, archive_command fails after the
timeout specified by the NFS mount, as expected. (for a soft mount. hard
mount hangs, as expected)However, on restoring connection, it's not clear to me how long it takes
before the command is retried.
Experience says "a few minutes", but I can't find documentation on an
exact algorithm.
To be clear, the question is: if archive_command fails, what are the
specifics of retrying? Is there a timeout? How is that timeout defined?
Is this detailed somewhere? Perhaps in the source code? I couldn't find
it in the documentation.
For detail, I'm using postgres 11, running on Ubuntu 20.
You can find the details in "src/backend/postmaster/pgarch.c".
The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an
interval
of one second, then back off until it receives a signal, PostgreSQL shutd
down
or a minute has passed.Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:
On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
When connection is gone or blocked, archive_command fails after the timeout specified
by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)However, on restoring connection, it's not clear to me how long it takes before the command is retried.
Experience says "a few minutes", but I can't find documentation on an exact algorithm.
To be clear, the question is: if archive_command fails, what are the specifics of retrying?
Is there a timeout? How is that timeout defined?Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
For detail, I'm using postgres 11, running on Ubuntu 20.
You can find the details in "src/backend/postmaster/pgarch.c".
The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.Thanks for the reply. That would mean the source code is here:
https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c
For release 11.0, yes.
Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?
No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if anybody
wakes up the archiver, it will try again.
If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen
that a walarchive is skipped if the archive_command fails too many times or takes too long? It
will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point
where it stops trying, waiting for the next time archive_command is invoked?
Even if a signal arrives, PostgreSQL will keep trying to archive that same WAL segment
that failed until it is done.
This is a potential sequence of events:
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds -> get woken up by a signal after 30 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
get shutdown request -> exit
When PostgreSQL restarts, it will continue trying to archive the same segment.
I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no
".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until
someone forcefully switches the timeline with for instance a basebackup)?
Yes, it will keep trying, and a timeline switch won't change that.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Thank you for your thorough explanation.
On Thu, May 19, 2022 at 5:47 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:
On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
When connection is gone or blocked, archive_command fails after the
timeout specified
by the NFS mount, as expected. (for a soft mount. hard mount hangs,
as expected)
However, on restoring connection, it's not clear to me how long it
takes before the command is retried.
Experience says "a few minutes", but I can't find documentation on
an exact algorithm.
To be clear, the question is: if archive_command fails, what are the
specifics of retrying?
Is there a timeout? How is that timeout defined?
Is this detailed somewhere? Perhaps in the source code? I couldn't
find it in the documentation.
For detail, I'm using postgres 11, running on Ubuntu 20.
You can find the details in "src/backend/postmaster/pgarch.c".
The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in
an interval
of one second, then back off until it receives a signal, PostgreSQL
shutd down
or a minute has passed.
Thanks for the reply. That would mean the source code is here:
https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c
For release 11.0, yes.
Just to be sure, the "signal" you speak of, this is the result of the
command executed by archive_command?
No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if
anybody
wakes up the archiver, it will try again.If my understanding of the code is right, if no SIGTERM or other signal
arrives, it won't ever happen
that a walarchive is skipped if the archive_command fails too many times
or takes too long? It
will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or
is the 60 seconds the point
where it stops trying, waiting for the next time archive_command is
invoked?
Even if a signal arrives, PostgreSQL will keep trying to archive that same
WAL segment
that failed until it is done.This is a potential sequence of events:
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds -> get woken up by a signal after 30 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
get shutdown request -> exitWhen PostgreSQL restarts, it will continue trying to archive the same
segment.I'm assuming that as long as the file is still in the pg_wal directory
and as long as there is no
".done" file for that walarchive under pg_wal/archive_status, it will
keep trying forever(or until
someone forcefully switches the timeline with for instance a basebackup)?
Yes, it will keep trying, and a timeline switch won't change that.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com