Basebackup fails without useful error message

Started by Koen De Grooteover 1 year ago11 messagesgeneral

kdg.dev@gmail.com

over 1 year ago

Having run a basebackup, I'm getting this output at the very end:

pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"

Is there a way to get more information as to what exactly happened?

I'd like to look into fixing this or doing whatever is required so that it
doesn't happen again, but this just isn't enough info. Where do I start
looking?

Regards,
Koen De Groote

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: Koen De Groote (#1)

Re: Basebackup fails without useful error message

On 9/29/24 07:48, Koen De Groote wrote:

Having run a basebackup, I'm getting this output at the very end:

pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"

What is the complete command you are using?

Is there a way to get more information as to what exactly happened?

Have you looked at the Postgres log?

Is --verbose being used?

I'd like to look into fixing this or doing whatever is required so that
it doesn't happen again, but this just isn't enough info. Where do I
start looking?

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Koen De Groote

kdg.dev@gmail.com

over 1 year ago

In reply to: Adrian Klaver (#2)

Re: Basebackup fails without useful error message

What is the complete command you are using?

The full command is:

pg_basebackup -h localhost -p 5432 -U basebackup_user -D
/mnt/base_backup/dir -Ft -z -P

So output Format as tar, gzipped, and with progress being printed.

Have you looked at the Postgres log?

Is --verbose being used?

This is straight from the logs, it's the only output besides the % progress
counter.

Will have a look at --verbose.

Regards,
Koen De Groote

On Sun, Sep 29, 2024 at 5:01 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 9/29/24 07:48, Koen De Groote wrote:

Having run a basebackup, I'm getting this output at the very end:

pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory

"/mnt/base_backup/dir/"

What is the complete command you are using?

Is there a way to get more information as to what exactly happened?

Have you looked at the Postgres log?

Is --verbose being used?

I'd like to look into fixing this or doing whatever is required so that
it doesn't happen again, but this just isn't enough info. Where do I
start looking?

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: Koen De Groote (#3)

Re: Basebackup fails without useful error message

On 9/29/24 08:57, Koen De Groote wrote:

What is the complete command you are using?

The full command is:

pg_basebackup -h localhost -p 5432 -U basebackup_user -D
/mnt/base_backup/dir -Ft -z -P

So output Format as tar, gzipped, and with progress being printed.

Have you looked at the Postgres log?

Is --verbose being used?

This is straight from the logs, it's the only output besides the %
progress counter.

Will have a look at --verbose.

When you report on that and if it does not report the error then what is?:

Postgres version.

OS and version.

Anything special about the cluster like tablespaces, extensions,
replication, etc.

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Muhammad Usman Khan

usman.k@bitnine.net

over 1 year ago

In reply to: Koen De Groote (#1)

Re: Basebackup fails without useful error message

Hi,
You can enable verbose to get more detail about what is happening during
backup
pg_basebackup -h your_host -U your_user -D /mnt/base_backup/dir/ -v
Also examine recent postgres logs
Verify backup directory permissions

On Sun, 29 Sept 2024 at 19:49, Koen De Groote <kdg.dev@gmail.com> wrote:

Show quoted text

Having run a basebackup, I'm getting this output at the very end:

pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"

Is there a way to get more information as to what exactly happened?

I'd like to look into fixing this or doing whatever is required so that it
doesn't happen again, but this just isn't enough info. Where do I start
looking?

Regards,
Koen De Groote

Koen De Groote

kdg.dev@gmail.com

over 1 year ago

In reply to: Adrian Klaver (#4)

Re: Basebackup fails without useful error message

Hello Adrian, and everyone else.

It has finally happened, the backup ran into an error again, and the
verbose output set me on the right path.

I'm getting this error message:

pg_basebackup: could not receive data from WAL stream: server closed the

connection unexpectedly

This probably means the server terminated abnormally
before or while processing the request.

Combined with the main server logging:

terminating walsender process due to replication timeout

Now, the server is set up with an archive_command which gzips the WAL files
and writes them to a network filesystem.

From looking at machine metrics at the time, my conclusion is the following:

At the time of the error, the remote filesystem experienced a very high
queue size for new writes.

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive is
written, not just when the WAL file is written in pg_wal.

I'm also seeing in the documentation that the default WAL method for
pg_basebackup is "stream", which waits for these WAL files as they are
produced.

I suspect that I have 2 possible paths at this point:

1: increase wal_sender_timeout
2: run the basebackup with --wal-method=none since my restore_command is
set up to explicitly go to the very same network storage to get the
archived WAL files.

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

Regards,
Koen De Groote

On Sun, Sep 29, 2024 at 6:08 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 9/29/24 08:57, Koen De Groote wrote:

What is the complete command you are using?

The full command is:

pg_basebackup -h localhost -p 5432 -U basebackup_user -D
/mnt/base_backup/dir -Ft -z -P

So output Format as tar, gzipped, and with progress being printed.

Have you looked at the Postgres log?

Is --verbose being used?

This is straight from the logs, it's the only output besides the %
progress counter.

Will have a look at --verbose.

When you report on that and if it does not report the error then what is?:

Postgres version.

OS and version.

Anything special about the cluster like tablespaces, extensions,
replication, etc.

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: Koen De Groote (#6)

Re: Basebackup fails without useful error message

On 10/20/24 14:03, Koen De Groote wrote:

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive
is written, not just when the WAL file is written in pg_wal.

https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL

"It is important that the archive command return zero exit status if and
only if it succeeds. Upon getting a zero result, PostgreSQL will assume
that the file has been successfully archived, and will remove or recycle
it. However, a nonzero status tells PostgreSQL that the file was not
archived; it will try again periodically until it succeeds."

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Koen De Groote

kdg.dev@gmail.com

over 1 year ago

In reply to: Adrian Klaver (#7)

Re: Basebackup fails without useful error message

See, I'm reading that, and my conclusion is "Ah great, it will try again,
everything will be fine."

There's a link between the archive_command, pg_basebackup and
wal_sender_timeout, but that link isn't clear from just reading all these
documentation pages when learning about their concepts separately.

It's just too much information to read and then "just know" in your head
that these must be the relations. Especially if you have dozens of other
things to do.

That's my 2 cents.

Regards,
Koen De Groote

On Sun, Oct 20, 2024 at 11:12 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 10/20/24 14:03, Koen De Groote wrote:

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive
is written, not just when the WAL file is written in pg_wal.

https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL

"It is important that the archive command return zero exit status if and
only if it succeeds. Upon getting a zero result, PostgreSQL will assume
that the file has been successfully archived, and will remove or recycle
it. However, a nonzero status tells PostgreSQL that the file was not
archived; it will try again periodically until it succeeds."

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: Koen De Groote (#8)

Re: Basebackup fails without useful error message

On 10/21/24 13:49, Koen De Groote wrote:

See, I'm reading that, and my conclusion is "Ah great, it will try
again, everything will be fine."

Unless it never completes and you hit the timeout.

There's a link between the archive_command, pg_basebackup and
wal_sender_timeout, but that link isn't clear from just reading all
these documentation pages when learning about their concepts separately.

It's just too much information to read and then "just know" in your head
that these must be the relations. Especially if you have dozens of other
things to do.

That's my 2 cents.

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

#10

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Koen De Groote (#6)

Re: Basebackup fails without useful error message

On Sunday, October 20, 2024, Koen De Groote <kdg.dev@gmail.com> wrote:

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

The archiving of WAL files by the primary does not involve a replication
connection of any sort and thus the “WAL sender” settings are not relevant
to it; or, here, whether or not you are archiving your WAL is immaterial
since you are streaming it as it gets produced.

If you are streaming WAL it seems highly unusual that you’d end up in a
situation where the connection goes idle long enough that it gets killed,
especially if the backup is still happening. I’d probably go with
performing the backup under a disabled (or extremely large?) timeout though
and move on to other things.

That isn’t to say I fully understand what actually is happening here…

David J.

#11

Koen De Groote

kdg.dev@gmail.com

over 1 year ago

In reply to: David G. Johnston (#10)

Re: Basebackup fails without useful error message

Hello David,

I saw the backup fail. The backup logged that it terminated the walsender,
and correlating the moment it failed to the metrics of my storage, shows
the storage at that time was facing a huge IOWAIT. And this was a network
mounted storage.

The backup process continued, but because of a failure to stream WAL
without error(due to a local issue) the entire backup was marked as failed.
At the end, pg_basebackup will delete the backup, in this case. There's no
flag to control this final behavior.

I'll be testing restore soon without streaming WAL, since the actual
restore I perform doesn't use the pg_wal.tar.gz file. It gets the archived
WAL At least I think it doesn't need it, hence the need for testing.

Regards,
Koen De Groote

On Tue, Oct 22, 2024 at 12:34 AM David G. Johnston <
david.g.johnston@gmail.com> wrote:

Show quoted text

On Sunday, October 20, 2024, Koen De Groote <kdg.dev@gmail.com> wrote:

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

The archiving of WAL files by the primary does not involve a replication
connection of any sort and thus the “WAL sender” settings are not relevant
to it; or, here, whether or not you are archiving your WAL is immaterial
since you are streaming it as it gets produced.

If you are streaming WAL it seems highly unusual that you’d end up in a
situation where the connection goes idle long enough that it gets killed,
especially if the backup is still happening. I’d probably go with
performing the backup under a disabled (or extremely large?) timeout though
and move on to other things.

That isn’t to say I fully understand what actually is happening here…

David J.