In case of network issues, how long before archive_command does retries

Started by Koen De Grootealmost 4 years ago5 messagesgeneral
Jump to latest
#1Koen De Groote
kdg.dev@gmail.com

I've got a setup where archive_command will gzip the wal archive to a
directory that is itself an NFS mount.

When connection is gone or blocked, archive_command fails after the timeout
specified by the NFS mount, as expected. (for a soft mount. hard mount
hangs, as expected)

However, on restoring connection, it's not clear to me how long it takes
before the command is retried.

Experience says "a few minutes", but I can't find documentation on an exact
algorithm.

To be clear, the question is: if archive_command fails, what are the
specifics of retrying? Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't find it
in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

Regards,
Koen

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Koen De Groote (#1)
Re: In case of network issues, how long before archive_command does retries

On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:

I've got a setup where archive_command will gzip the wal archive to a directory that is itself an NFS mount.

When connection is gone or blocked, archive_command fails after the timeout specified by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)

However, on restoring connection, it's not clear to me how long it takes before the command is retried.

Experience says "a few minutes", but I can't find documentation on an exact algorithm.

To be clear, the question is: if archive_command fails, what are the specifics of retrying? Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#3Koen De Groote
kdg.dev@gmail.com
In reply to: Laurenz Albe (#2)
Re: In case of network issues, how long before archive_command does retries

Hello Laurenz,

Thanks for the reply. That would mean the source code is here:
https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

Just to be sure, the "signal" you speak of, this is the result of the
command executed by archive_command?

If my understanding of the code is right, if no SIGTERM or other signal
arrives, it won't ever happen that a walarchive is skipped if the
archive_command fails too many times or takes too long? It will simply
check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60
seconds the point where it stops trying, waiting for the next time
archive_command is invoked?

I'm assuming that as long as the file is still in the pg_wal directory and
as long as there is no ".done" file for that walarchive under
pg_wal/archive_status, it will keep trying forever(or until someone
forcefully switches the timeline with for instance a basebackup)?

Apologies, I already sent this message once, but only to Laurenz. Sending
again to have it in the archives.

Regards,
Koen

On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:

I've got a setup where archive_command will gzip the wal archive to a

directory that is itself an NFS mount.

When connection is gone or blocked, archive_command fails after the

timeout specified by the NFS mount, as expected. (for a soft mount. hard
mount hangs, as expected)

However, on restoring connection, it's not clear to me how long it takes

before the command is retried.

Experience says "a few minutes", but I can't find documentation on an

exact algorithm.

To be clear, the question is: if archive_command fails, what are the

specifics of retrying? Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't find

it in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an
interval
of one second, then back off until it receives a signal, PostgreSQL shutd
down
or a minute has passed.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Koen De Groote (#3)
Re: In case of network issues, how long before archive_command does retries

On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:

On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:

When connection is gone or blocked, archive_command fails after the timeout specified
by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)

However, on restoring connection, it's not clear to me how long it takes before the command is retried.

Experience says "a few minutes", but I can't find documentation on an exact algorithm.

To be clear, the question is: if archive_command fails, what are the specifics of retrying?
Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.

Thanks for the reply. That would mean the source code is here:
https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

For release 11.0, yes.

Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?

No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if anybody
wakes up the archiver, it will try again.

If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen
that a walarchive is skipped if the archive_command fails too many times or takes too long? It
will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point
where it stops trying, waiting for the next time archive_command is invoked?

Even if a signal arrives, PostgreSQL will keep trying to archive that same WAL segment
that failed until it is done.

This is a potential sequence of events:

try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds -> get woken up by a signal after 30 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
get shutdown request -> exit

When PostgreSQL restarts, it will continue trying to archive the same segment.

I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no
".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until
someone forcefully switches the timeline with for instance a basebackup)?

Yes, it will keep trying, and a timeline switch won't change that.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#5Koen De Groote
kdg.dev@gmail.com
In reply to: Laurenz Albe (#4)
Re: In case of network issues, how long before archive_command does retries

Thank you for your thorough explanation.

On Thu, May 19, 2022 at 5:47 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:

On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at>

wrote:

On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:

When connection is gone or blocked, archive_command fails after the

timeout specified

by the NFS mount, as expected. (for a soft mount. hard mount hangs,

as expected)

However, on restoring connection, it's not clear to me how long it

takes before the command is retried.

Experience says "a few minutes", but I can't find documentation on

an exact algorithm.

To be clear, the question is: if archive_command fails, what are the

specifics of retrying?

Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't

find it in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in

an interval

of one second, then back off until it receives a signal, PostgreSQL

shutd down

or a minute has passed.

Thanks for the reply. That would mean the source code is here:

https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

For release 11.0, yes.

Just to be sure, the "signal" you speak of, this is the result of the

command executed by archive_command?

No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if
anybody
wakes up the archiver, it will try again.

If my understanding of the code is right, if no SIGTERM or other signal

arrives, it won't ever happen

that a walarchive is skipped if the archive_command fails too many times

or takes too long? It

will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or

is the 60 seconds the point

where it stops trying, waiting for the next time archive_command is

invoked?

Even if a signal arrives, PostgreSQL will keep trying to archive that same
WAL segment
that failed until it is done.

This is a potential sequence of events:

try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 1 second
try to archive -> fail
sleep 60 seconds -> get woken up by a signal after 30 seconds
try to archive -> fail
sleep 1 second
try to archive -> fail
get shutdown request -> exit

When PostgreSQL restarts, it will continue trying to archive the same
segment.

I'm assuming that as long as the file is still in the pg_wal directory

and as long as there is no

".done" file for that walarchive under pg_wal/archive_status, it will

keep trying forever(or until

someone forcefully switches the timeline with for instance a basebackup)?

Yes, it will keep trying, and a timeline switch won't change that.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com