PostgreSQL 10 not archiving some WAL files

Started by Norberto Delleabout 6 years ago4 messagesgeneral

betodelle@gmail.com

about 6 years ago

I have a strange case in some PostgreSQL instances that I monitor. First
the software versions and setup of a specific one:

- Windows Server 2012 Foundation
- PostgreSQL 10.10 x64
- archive_command = 'copy% p "D: \\ backup \\ wal \\% f" / y'

This setup worked very well for a long time, but since we upgraded
PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
wal files. This seems to happen mainly when the server is shut down.
There's no messages about errors in the logs, and also no corresponding
.done file in archive_status, it just goes to the next wal file.

I would like to know if any of you had or has a problem like this, and if
you had, how you solved it.

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Norberto Delle (#1)

Re: PostgreSQL 10 not archiving some WAL files

On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:

This setup worked very well for a long time, but since we upgraded
PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
wal files. This seems to happen mainly when the server is shut down.
There's no messages about errors in the logs, and also no corresponding
.done file in archive_status, it just goes to the next wal file.

I would like to know if any of you had or has a problem like this, and if
you had, how you solved it.

Hmm. I don't recall seeing any of that. This would mean that we are
either missing the generation of some .ready file, or that some .done
file gets generated when they should not in archive_status/. What
kind of server shutdown are you doing? Immediate so as recovery
happens at the follow-up startup. Or is that a clean service
shutdown?
--
Michael

Norberto Delle

betodelle@gmail.com

about 6 years ago

In reply to: Michael Paquier (#2)

Re: PostgreSQL 10 not archiving some WAL files

Show quoted text

On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:

This setup worked very well for a long time, but since we upgraded
PostgreSQL to newer versions (10.x), sometimes it just skips archiving

some

wal files. This seems to happen mainly when the server is shut down.
There's no messages about errors in the logs, and also no corresponding
.done file in archive_status, it just goes to the next wal file.

I would like to know if any of you had or has a problem like this, and

if

you had, how you solved it.

Hmm. I don't recall seeing any of that. This would mean that we are
either missing the generation of some .ready file, or that some .done
file gets generated when they should not in archive_status/. What
kind of server shutdown are you doing? Immediate so as recovery
happens at the follow-up startup. Or is that a clean service
shutdown?
--
Michael
...

Hi Michael

Almost daily the server is shutdown, because my client wants to save
energy.
In the logs, it appears that the service was not shut down properly.
Here's the last lines of one log file:

2020-03-16 18:10:03.130 -04 [1964] LOG: received fast shutdown request
2020-03-16 18:10:03.132 -04 [2948] ERROR: canceling statement due to user
request
2020-03-16 18:10:03.192 -04 [1964] LOG: aborting any active transactions
2020-03-16 18:10:03.194 -04 [1964] LOG: worker process: logical
replication launcher (PID 2948) exited with exit code 1
2020-03-16 18:10:03.210 -04 [2188] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [6688] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [3348] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [6356] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [5736] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [4028] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.211 -04 [2552] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.212 -04 [4280] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [2336] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.227 -04 [6564] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.255 -04 [2884] LOG: shutting down

When the service is started the next day, this is what Postgres logs:

2020-03-17 07:47:26.721 -04 [2272] LOG: database system was interrupted;
last known up at 2020-03-16 17:56:20 -04
2020-03-17 07:47:34.974 -04 [2884] FATAL: the database system is starting
up
2020-03-17 07:47:39.378 -04 [2272] LOG: database system was not properly
shut down; automatic recovery in progress
2020-03-17 07:47:39.631 -04 [2272] LOG: redo starts at 38/F30F5108
2020-03-17 07:47:40.038 -04 [3052] FATAL: the database system is starting
up
2020-03-17 07:47:40.126 -04 [2272] LOG: redo done at 38/F4001738
2020-03-17 07:47:40.207 -04 [2272] LOG: last completed transaction was at
log time 2020-03-16 18:06:53.269798-04
2020-03-17 07:47:42.590 -04 [2080] LOG: database system is ready to
accept connections

In this shutdown/startup process, this PostgreSQL instance skipped
archiving a wal file.
Another thing that I perceived today is that in another instance, days
after (about a week) skipping a wal file, the file was finally archived.
Both of these instances that I talked about also have streaming
replication configured. But I don't think it is related because there other
instances skipping archiving wal files that don't have streaming
replication configured.

Norberto

Norberto Delle

betodelle@gmail.com

about 6 years ago

In reply to: Michael Paquier (#2)

Re: PostgreSQL 10 not archiving some WAL files

On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:

This setup worked very well for a long time, but since we upgraded
PostgreSQL to newer versions (10.x), sometimes it just skips archiving

some

wal files. This seems to happen mainly when the server is shut down.
There's no messages about errors in the logs, and also no corresponding
.done file in archive_status, it just goes to the next wal file.

I would like to know if any of you had or has a problem like this, and if
you had, how you solved it.

Hmm. I don't recall seeing any of that. This would mean that we are
either missing the generation of some .ready file, or that some .done
file gets generated when they should not in archive_status/. What
kind of server shutdown are you doing? Immediate so as recovery
happens at the follow-up startup. Or is that a clean service
shutdown?
--
Michael
...

Hi Michael

Almost daily the server is shutdown, because my client wants to save energy.
In the logs, it appears that the service was not shut down properly. Here's
the last lines of one log file:

2020-03-16 18:10:03.130 -04 [1964] LOG: received fast shutdown request
2020-03-16 18:10:03.132 -04 [2948] ERROR: canceling statement due to user
request
2020-03-16 18:10:03.192 -04 [1964] LOG: aborting any active transactions
2020-03-16 18:10:03.194 -04 [1964] LOG: worker process: logical
replication launcher (PID 2948) exited with exit code 1
2020-03-16 18:10:03.210 -04 [2188] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [6688] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [3348] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [6356] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [5736] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [4028] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.211 -04 [2552] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.212 -04 [4280] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.210 -04 [2336] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.227 -04 [6564] FATAL: terminating connection due to
administrator command
2020-03-16 18:10:03.255 -04 [2884] LOG: shutting down

When the service is started the next day, this is what Postgres logs:

2020-03-17 07:47:26.721 -04 [2272] LOG: database system was interrupted;
last known up at 2020-03-16 17:56:20 -04
2020-03-17 07:47:34.974 -04 [2884] FATAL: the database system is starting
up
2020-03-17 07:47:39.378 -04 [2272] LOG: database system was not properly
shut down; automatic recovery in progress
2020-03-17 07:47:39.631 -04 [2272] LOG: redo starts at 38/F30F5108
2020-03-17 07:47:40.038 -04 [3052] FATAL: the database system is starting
up
2020-03-17 07:47:40.126 -04 [2272] LOG: redo done at 38/F4001738
2020-03-17 07:47:40.207 -04 [2272] LOG: last completed transaction was at
log time 2020-03-16 18:06:53.269798-04
2020-03-17 07:47:42.590 -04 [2080] LOG: database system is ready to accept
connections

In this shutdown/startup process, this PostgreSQL instance skipped
archiving a wal file.
Another thing that I perceived today is that in another instance, days
after (about a week) skipping a wal file, the file was finally archived.
Both of these instances that I talked about also have streaming replication
configured. But I don't think it is related because there other instances
skipping archiving wal files that don't have streaming replication
configured.

Norberto