12.2: Why do my Redo Logs disappear??

Started by Peteralmost 6 years ago41 messagesgeneral

Jump to latest

Peter

pmc@citylink.dinoex.sub.org

almost 6 years ago

Hi all,
this is a 12.2 Release on FreeBSD 11.3.

I am doing RedoLog Archiving according to Docs Chapter 25.1.

During the last week I have lost 4 distinct Redo Logs; they are
not in the backup.

Loosing a RedoLog is very bad, because there is no redundancy,
loosing a single one of them makes the timeline disappear and it
will only reappear after another Base Backup. Very very bad.

So I did analyze the matter.

There are three ways to restart a Berkeley machine:

1. Cowboy it (aka pull-the-plug). This makes sure that everything is
certainly dead immediately, and, given all hard- and software is
well designed, nothing bad should happen.

2. Shut it down. This is the official means, and it takes very long,
because each and every applications are signalled and given time to
bring up whatever objections they may have.

In this case "pg_ctl stop" will be invoked with whatever options the
sysop has configured, and postgres will copy out a full log into
archive before terminating.

3. Halt/Reboot it, like this:
https://www.freebsd.org/cgi/man.cgi?query=reboot&sektion=8&manpath=FreeBSD+11.3-RELEASE
This is considered more safe than pull-the-plug, and still fast.
Applications are killed without much waiting, but all disk buffers
are flushed to permanent storage and filesystems closed.

In this case, it seems, Postgres will delete the current log
without archiving it. :(

What precisely happens (according to the OS sources) during reboot
is this: processes will be sent SIGTERM, and after some 2-10
seconds followed by SIGKILL.

--------------------------------------

Lets have a closer look:

We did a regular shutdown at 17:09, and then we did a reboot at 19:24.

Here is the content of the staging area (where the logs are
copied to and accumulated until it is worth to run a backup job):

# dir arch/
total 240
drwx------ 2 postgres postgres 5 Jun 8 17:09 .
drwxr-xr-x 6 postgres postgres 7 Jun 8 17:09 ..
-rw------- 1 postgres postgres 16777216 Jun 8 09:38 0000000100000017000000FC.ok
-rw------- 1 postgres postgres 16777216 Jun 8 10:48 0000000100000017000000FD.ok
-rw------- 1 postgres postgres 16777216 Jun 8 17:09 0000000100000017000000FE.ok

And here is the pg_wal directory:

# dir data12/pg_wal/
total 89256
drwx------ 3 postgres postgres 10 Jun 8 19:28 .
drwx------ 19 postgres postgres 23 Jun 8 19:28 ..
-rw------- 1 postgres postgres 335 Jun 7 07:36 0000000100000017000000EF.00000060.backup
-rw------- 1 postgres postgres 16777216 Jun 8 19:38 000000010000001800000000
-rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000001
-rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000002
-rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000003
-rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000004
-rw------- 1 postgres postgres 16777216 Jun 7 07:36 000000010000001800000005
drwx------ 2 postgres postgres 3 Jun 8 17:09 archive_status
# dir data12/pg_wal/archive_status
total 23
drwx------ 2 postgres postgres 3 Jun 8 17:09 .
drwx------ 3 postgres postgres 10 Jun 8 19:28 ..
-rw------- 1 postgres postgres 0 Jun 7 07:36 0000000100000017000000EF.00000060.backup.done

Now where the hell is my "FF" log ???

Lets check syslog - this was the full shutdown at 17:09:

Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-1] :[] LOG: 00000: received fast shutdown request
Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-2] :[] LOCATION: pmdie, postmaster.c:2780
Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-1] :[] LOG: 00000: aborting any active transactions
Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-2] :[] LOCATION: pmdie, postmaster.c:2813
Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-1] :[] DEBUG: 00000: logical replication launcher shutting down
Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-2] :[] LOCATION: ProcessInterrupts, postgres.c:2981
Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-1] :[] LOG: 00000: background worker "logical replication launcher" (PID 73549) exited with exit code 1
Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-2] :[] LOCATION: LogChildExit, postmaster.c:3657
Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-1] :[] LOG: 00000: shutting down
Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-2] :[] LOCATION: ShutdownXLOG, xlog.c:8321
Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck invoked pg_wal/0000000100000017000000FE 0000000100000017000000FE
Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck pg_wal/0000000100000017000000FE 0000000100000017000000FE returns 0
Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-1] :[] DEBUG: 00000: archived write-ahead log file "0000000100000017000000FE"
Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-2] :[] LOCATION: pgarch_archiveXlog, pgarch.c:675
Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-1] :[] LOG: 00000: checkpoint starting: shutdown immediate
Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362
Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-1] :[] DEBUG: 00000: performing replication slot checkpoint
Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078
Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=10.131 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435
Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-1] :[] LOG: 00000: database system is shut down
Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-2] :[] LOCATION: UnlinkLockFiles, miscinit.c:859

So, "FE" was copied out okay, and we have it in the staging area.

And here is the relevant stuff from the following startup - here we
consequentially are within the "FF" log:

Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-1] :[] LOG: 00000: database system was shut down at 2020-06-08 17:09:55 CEST
Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6242
Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF000024
Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532
Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF000024; shutdown true
----------------------------------------

From the fast reboot @ 19:24 noting at all is logged.

At the following startup, we see that we are still within the "FF"
log:

Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-1] :[] DEBUG: 00000: registering background worker "logical replication launcher"
Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-2] :[] LOCATION: RegisterBackgroundWorker, bgworker.c:854
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-1] :[] LOG: 00000: starting PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-2] :[] LOCATION: PostmasterMain, postmaster.c:997
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-1] :[] LOG: 00000: listening on IPv4 address "0.0.0.0", port 5433
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-2] :[] LOCATION: StreamServerPort, pqcomm.c:590
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-1] :[] LOG: 00000: listening on Unix socket "/tmp/.s.PGSQL.5433"
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-2] :[] LOCATION: StreamServerPort, pqcomm.c:585
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-1] :[] LOG: 00000: ending log output to stderr
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-2] :[] HINT: Future log output will go to log destination "syslog".
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-3] :[] LOCATION: PostmasterMain, postmaster.c:1297
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-1] :[] LOG: XX000: could not send test message on socket for statistics collector: Permission denied
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-2] :[] LOCATION: pgstat_init, pgstat.c:486
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-1] :[] LOG: 00000: trying another address for the statistics collector
Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-2] :[] LOCATION: pgstat_init, pgstat.c:418
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-1] :[] LOG: 00000: database system was interrupted; last known up at 2020-06-08 17:25:38 CEST
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6267
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF01BFE8
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF01BFB4; shutdown false
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-2] :[] LOCATION: StartupXLOG, xlog.c:6609
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-1] :[] DEBUG: 00000: next transaction ID: 18955154; next OID: 145913
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-2] :[] LOCATION: StartupXLOG, xlog.c:6613
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-1] :[] DEBUG: 00000: next MultiXactId: 1; next MultiXactOffset: 0
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-2] :[] LOCATION: StartupXLOG, xlog.c:6617
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-1] :[] DEBUG: 00000: oldest unfrozen transaction ID: 479, in database 13777
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-2] :[] LOCATION: StartupXLOG, xlog.c:6620
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-1] :[] DEBUG: 00000: oldest MultiXactId: 1, in database 1
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-2] :[] LOCATION: StartupXLOG, xlog.c:6623
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-1] :[] DEBUG: 00000: commit timestamp Xid oldest/newest: 0/0
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-2] :[] LOCATION: StartupXLOG, xlog.c:6626
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-1] :[] DEBUG: 00000: transaction ID wrap limit is 2147484126, limited by database with OID 13777
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-2] :[] LOCATION: SetTransactionIdLimit, varsup.c:408
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-1] :[] DEBUG: 00000: starting up replication slots
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-2] :[] LOCATION: StartupReplicationSlots, slot.c:1114
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-1] :[] LOG: 00000: database system was not properly shut down; automatic recovery in progress
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-2] :[] LOCATION: StartupXLOG, xlog.c:6764
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 1 init 0
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-1] :[] LOG: 00000: redo starts at 17/FF01BFB4
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-2] :[] LOCATION: StartupXLOG, xlog.c:7035
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-1] :[] LOG: 00000: redo done at 17/FF01C098
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-2] :[] LOCATION: StartupXLOG, xlog.c:7297
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 0 init 1
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-1] :[] LOG: 00000: checkpoint starting: end-of-recovery immediate
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-1] :[] DEBUG: 00000: performing replication slot checkpoint
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16272 kB, estimate=16272 kB
Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-1] :[] DEBUG: 00000: MultiXact member stop limit is now 4294914944 based on MultiXact 1
Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-2] :[] LOCATION: SetOffsetVacuumLimit, multixact.c:2630

There is no further information about anything concerning the logs,
but at this point in time "FF" has disappeared. It was NOT copied out,
we can see the timestamp on the pg_wal/archive_status still being at
17:09.

Nothing except postgres is supposed to write/delete anything within
the data tree, and the RedoLog.bck script doesn't delete anything at all.

Another cute question would be: The "FE" log was successfully copied
out at 17:09:45. The checkpoint at 17:09:55 then says "1 removed" -
this should be the "FE", because the one before was written seven
hours earlier (see the "ls" above) and should by long be gone.

But then the checkpoint at 19:28:24 again says "1 removed". What was
removed there? It is unlikely to be the "FE", and the "FF" would be
currently in use - and the archive_status directory was not written
since 17:09. But the "FF" has disappeared. So what is going on here?

cheerio,
PMc