WAL segments removed from primary despite the fact that logical replication slot needs it.
Hi,
We are still in process of preparing upgrades from Ubuntu Bionic with Pg
12 to Focal with Pg 14.
And today (well, yesterday) we hit something that looks like a bug.
Jumping to problem - wal files needed by logical replication were
removed from primary, despite slot being there.
Now, a bit more details.
Our replication is setup using following method:
1. We create all tables and primary keys on pg14
2. replication slot is created, with no tables
3. we add some tables to replication slot, and wait for all tables to
switch state to "ready"
4. rinse and repeat until all tables are in replication set
5. once all is ready, we add rest of schema (indexes, constraints,
functions, fkeys, and so on).
This works. Worked?
Yesterday we saw something that broke the replication.
First data from bionic side:
when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58
2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898
2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80
2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8
2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8
2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20
2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
As you can see around 2:20pm replication stopped fetching data.
At this time, in pg14 logs don't show any problems what so ever.
Pg_restore was doing it's work with create indexes, and some queries
like:
ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;
At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
But these errors happened earlier too (minutes/hours earlier).
But then, 3 minutes later I got:
2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
Unfortunately I don't know the content of pg_wal on bionic at the
moment, but currently I see:
$ ls -l pg_wal/| head
total 297534552
-rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054
So, 4 files are missing.
These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"
I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).
But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.
Anyone has any idea on how to investigate/fix that?
Best regards,
depesz
On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
Hi,
We are still in process of preparing upgrades from Ubuntu Bionic with Pg
12 to Focal with Pg 14.And today (well, yesterday) we hit something that looks like a bug.
Jumping to problem - wal files needed by logical replication were
removed from primary, despite slot being there.Now, a bit more details.
Our replication is setup using following method:
1. We create all tables and primary keys on pg14
2. replication slot is created, with no tables
3. we add some tables to replication slot, and wait for all tables to
switch state to "ready"
4. rinse and repeat until all tables are in replication set
5. once all is ready, we add rest of schema (indexes, constraints,
functions, fkeys, and so on).This works. Worked?
Yesterday we saw something that broke the replication.
First data from bionic side:
when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58
2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898
2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80
2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8
2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8
2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20
2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048
2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30
2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30As you can see around 2:20pm replication stopped fetching data.
Around that time, it seems the slot is also marked inactive probably
due to walsender exit.
At this time, in pg14 logs don't show any problems what so ever.
Pg_restore was doing it's work with create indexes, and some queries
like:
ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0But these errors happened earlier too (minutes/hours earlier).
But then, 3 minutes later I got:
2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().
Unfortunately I don't know the content of pg_wal on bionic at the
moment, but currently I see:$ ls -l pg_wal/| head
total 297534552
-rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053
-rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054So, 4 files are missing.
These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"
I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.
I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.Anyone has any idea on how to investigate/fix that?
As suggested above, I feel we need to add some LOGS to see what is
going on unless you have a reproducible testcase.
--
With Regards,
Amit Kapila.
At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0But these errors happened earlier too (minutes/hours earlier).
But then, 3 minutes later I got:
2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().
What I can see from the information at hand is that walsender is
killed (not died) twice for some reasons then some "reserved" WAL
files are lost. I don't see the reason for the second walsender (pid
29207) being killed but if the "has already been removed" log line
above is the first one, the file removal is probably happend AT
14:23:10. (log is needed to know that.)
Anyway, I didn't come up with a possibility other than of someone else
having removed those files.. For example, virus scan softwares. If
any such software kills processes that is grabbing the files that the
software identified as malicious and is going to remove, that could be
happen?
So, 4 files are missing.
These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.
lastRemovedSegNo is updated once per one segment of removal. Four
files are lost in this case.
I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.Anyone has any idea on how to investigate/fix that?
As suggested above, I feel we need to add some LOGS to see what is
going on unless you have a reproducible testcase.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().
That would require changing pg sources, I think, recompiling, and
retrying?
And on the pg12 side? I don't think we can do it, but I'll ask boss.
Anyone has any idea on how to investigate/fix that?
As suggested above, I feel we need to add some LOGS to see what is
going on unless you have a reproducible testcase.
The problem is that we were trying to use this approach to upgrade pg12
production db to pg14. Modifying it (pg12) will be tricky, as it would
require restart. Unless there is a way to add extra logging without
restart...
depesz
On Mon, Oct 17, 2022 at 03:53:07PM +0900, Kyotaro Horiguchi wrote:
Anyway, I didn't come up with a possibility other than of someone else
having removed those files.. For example, virus scan softwares. If
any such software kills processes that is grabbing the files that the
software identified as malicious and is going to remove, that could be
happen?
We don't use virus scanning there. The pg12 server is on ubuntu 18.04,
used for normal production workload, and pg14 is on ubuntu focal, with
no load, aside from 2 physical streaming replication to pg14 replicas.
Best regards,
depesz
On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
So, 4 files are missing.
These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.lastRemovedSegNo is updated once per one segment of removal. Four
files are lost in this case.
I didn't understand your response. I was saying the one possible
reason why even after restoring files from the archive the error
appears is because of the lastRemovedSegNo related check in function
CheckXLogRemoved() and we update its value while removing old xlog
files. From this behavior, it appears that somehow the server has only
removed those files even though the reason is not clear yet.
--
With Regards,
Amit Kapila.
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().That would require changing pg sources, I think, recompiling, and
retrying?
Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?
Is this problem reproducible? If so, can you find out why there are
multiple time connection issues between walsender and walreceiver?
--
With Regards,
Amit Kapila.
At Tue, 18 Oct 2022 16:51:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
So, 4 files are missing.
These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.lastRemovedSegNo is updated once per one segment of removal. Four
files are lost in this case.I didn't understand your response. I was saying the one possible
reason why even after restoring files from the archive the error
appears is because of the lastRemovedSegNo related check in function
CheckXLogRemoved() and we update its value while removing old xlog
files. From this behavior, it appears that somehow the server has only
removed those files even though the reason is not clear yet.
I meant that if PostgreSQL did something wrong (that I don't
understand at all) there, the difference between lastRemovedSegNo and
the last actually removed segment won't be more than 1.
CheckXLogRevmoed warns for a segment logically already removed but was
physically able to be opend. On the other hand WalSndSegmentOpen
emits the same error earlier than CHeckXLogRemoved warns if the
segment is actually ENOENT regardless of the value of
lastRemovedSegNo.
One point that bothers me is it seems that the walsender was
killed. If the file is removed while the walsender is working, logical
replication worker receives the error and emits "ERROR: could not
receive data...has been removed" instead of suddenly disconnected as
in this case. Considering this together with the possibility that the
segments were removed by someone else, I suspencted virus scanners but
that has found to be false.. (I still don't know there's any virus
scanner that kills processes grabbing a suspectedly-malicious file.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Oct 19, 2022 at 7:31 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
One point that bothers me is it seems that the walsender was
killed. If the file is removed while the walsender is working, logical
replication worker receives the error and emits "ERROR: could not
receive data...has been removed" instead of suddenly disconnected as
in this case. Considering this together with the possibility that the
segments were removed by someone else, I suspencted virus scanners but
that has found to be false..
Yeah, as requested yesterday, maybe checking if this can be reproduced
will help us to narrow down the cause. If this doesn't happen due to
an application mistake it appears that after the slot becomes inactive
during the time walsender was down, the server removes the required
files but there is nothing in the code that suggests that such a thing
can happen.
Are there any more logs on publisher/subscriber which indicate any
sort of error?
--
With Regards,
Amit Kapila.
On Wed, Oct 19, 2022 at 03:16:59PM +0530, Amit Kapila wrote:
Yeah, as requested yesterday, maybe checking if this can be reproduced
will help us to narrow down the cause. If this doesn't happen due to
an application mistake it appears that after the slot becomes inactive
during the time walsender was down, the server removes the required
files but there is nothing in the code that suggests that such a thing
can happen.
Are there any more logs on publisher/subscriber which indicate any
sort of error?
This specific setup of replication was torn down, we couldn't keep it
running. Can setup it once more, and with some luck it will break again.
Will take a day, though.
Is there anything I can do to make it more debugable, but that will not
require restart of pg12?
Best regards,
depesz
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().That would require changing pg sources, I think, recompiling, and
retrying?Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?Is this problem reproducible? If so, can you find out why there are
multiple time connection issues between walsender and walreceiver?
I can try to redo it, but before I do - anything I could do to either
side of replication to increase our chances at figuring out the
underlying problem? I can't restart pg12, though.
Best regards,
depesz
On Wed, Oct 19, 2022 at 3:50 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().That would require changing pg sources, I think, recompiling, and
retrying?Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?Is this problem reproducible? If so, can you find out why there are
multiple time connection issues between walsender and walreceiver?I can try to redo it, but before I do - anything I could do to either
side of replication to increase our chances at figuring out the
underlying problem? I can't restart pg12, though.
One idea is to change log level to DEBUG2 so that we can print which
files are removed by the server via
RemoveOldXlogFiles()
{
...
elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
lastoff);
...
}
If we can do this then at the very least we can know whether the
required files are removed by the server or by some external
application.
--
With Regards,
Amit Kapila.
On Wed, Oct 19, 2022 at 04:14:28PM +0530, Amit Kapila wrote:
One idea is to change log level to DEBUG2 so that we can print which
files are removed by the server via
RemoveOldXlogFiles()
{
...
elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
lastoff);
...
}If we can do this then at the very least we can know whether the
required files are removed by the server or by some external
application.
That should be doable. Will take some time to setup test env, and run
it. Will get back to you on this.
Best regards,
depesz
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().That would require changing pg sources, I think, recompiling, and
retrying?Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?
We're not. We're on 12.9, and there seem to be some wal-related changes
since then:
https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*
will talk with responsible people about upgrading it, and then redoing
the test.
Best regards,
depesz
On Wed, Oct 19, 2022 at 01:11:06PM +0200, hubert depesz lubaczewski wrote:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().That would require changing pg sources, I think, recompiling, and
retrying?Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?We're not. We're on 12.9, and there seem to be some wal-related changes
since then:
https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*will talk with responsible people about upgrading it, and then redoing
the test.
Hi,
just an update. Due to procedures we will upgrade to 12.12 at the
beginning of November. Then, I'll try to replicate the problem again,
and we'll see how this will pan out.
For next test will do my best to use debug2 log level.
Best regards,
depesz
On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
Hi,
just an update. Due to procedures we will upgrade to 12.12 at the
beginning of November. Then, I'll try to replicate the problem again,
and we'll see how this will pan out.
For next test will do my best to use debug2 log level.
So. I have an update.
Out of 8 servers that we prepared upgrade for, one failed with the
same/similar problem.
Unfortunately I don't know if this is the same server as before. Sorry.
Specifically, I see now, in logs on focal/pg14 side:
#v+
2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-
On source (bionic/pg12), I see:
#v+
=# select * from pg_replication_slots ;
slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
focal14 │ pgoutput │ logical │ 16607 │ canvas │ f │ f │ [null] │ [null] │ 3241443528 │ 1039D/83825958 │ 1039D/96453F38
(1 row)
#v-
but pg, like previously, lost some wal files, this time 8 of them:
#v+
=$ sudo ls -l ~postgres/12/main/pg_wal | head
total 224403644
-rw------- 1 postgres postgres 349 Feb 11 2022 000000010000BDEF000000FF.00045F68.backup
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
...
#v-
This time we had debug2 enabled on both sides.
Looks like we lost segments:
000000010001039D00000083, 000000010001039D00000084,
000000010001039D00000085, 000000010001039D00000086,
000000010001039D00000087, 000000010001039D00000088,
000000010001039D00000089, and 000000010001039D0000008A.
It looks that first time replica logged the error was:
#v+
2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-
In Pg logs on source I see:
#v+
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
#v-
So, the whole set of missing files was recycled at the same time.
One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
#v+
timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn
2022-11-10 20:50:04 UTC 103A6/C49D36F8 focal14 pgoutput logical 16607 canvas f t 16794 \N 3241375739 1039D/65365F68 1039D/782908E8
2022-11-10 20:51:03 UTC 103A6/CC957148 focal14 pgoutput logical 16607 canvas f f \N \N 3241375739 1039D/65365F68 1039D/782908E8
2022-11-10 20:52:02 UTC 103A6/D7348A30 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/78C94588
2022-11-10 20:53:01 UTC 103A6/EB24F408 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/7C8E7D78
2022-11-10 20:54:01 UTC 103A7/27A4778 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/65365F68 1039D/7C8E7D78
2022-11-10 20:55:00 UTC 103A7/1A635CD0 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/7F8BCD50
2022-11-10 20:55:59 UTC 103A7/2D5E6EE8 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/818B6990
2022-11-10 20:56:58 UTC 103A7/44548000 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/7E4CDF90 1039D/89E93620
2022-11-10 20:57:57 UTC 103A7/579F8000 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620
2022-11-10 20:58:56 UTC 103A7/6B6C96D0 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620
2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18
2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:09:46 UTC 103A8/233202E0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-10 21:10:45 UTC 103A8/2FBC9720 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
#v-
Now, I can keep this situation up for some time, if more debug will be helpful,
but I don't know how long (disk space issues).
Is there anything in here that helps? Any other info I can dig out?
Best regards,
depesz
On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote:
#v+
2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-
Sooo... plot thickens.
Without any changes, manual rebuild or anything, yesterday, the problem
seems to have solved itself?!
In logs on focal/pg14 I see:
#v+
2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-
And this is *the last* such message.
On bionic/pg12 we have in logs from pg_replication_slots:
#v+
timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn
2022-11-12 20:51:00 UTC 1041E/D3A0E540 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-12 20:51:59 UTC 1041E/D89B6000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-12 20:52:58 UTC 1041E/E0547450 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-12 20:53:58 UTC 1041E/E59634F0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-12 20:54:57 UTC 1041E/EBB50DE8 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
2022-11-12 20:55:55 UTC 1041E/FBBC3160 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241450490 1039D/9170B010 1039D/9B86EAF0
2022-11-12 20:56:55 UTC 1041F/FF34000 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241469432 1039D/A21B4598 1039D/A928A6D0
2022-11-12 20:57:54 UTC 1041F/277FAE30 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241480448 1039D/AD4C7A00 1039D/BA3FC840
2022-11-12 20:58:53 UTC 1041F/319A0000 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241505462 1039D/C5C32398 1039D/DF899428
2022-11-12 20:59:52 UTC 1041F/3A399688 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241527824 1039D/F3393280 1039E/AD9A740
...
2022-11-13 08:00:44 UTC 1043F/E464B738 focal14 pgoutput logical 16607 canvas f t 18626 \N 3292625412 1043F/E0F38E78 1043F/E4609628
#v-
I have no idea what has changed or why.
If it helps I can provide logs, but would prefer to do it off list.
Also, while the problem, with this cluster is "solved", I still have
like 300 other clusters to upgrade, and at least 1 has hit the same
problem today.
Best regards,
depesz
On Fri, Nov 11, 2022 at 8:20 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
Hi,
just an update. Due to procedures we will upgrade to 12.12 at the
beginning of November. Then, I'll try to replicate the problem again,
and we'll see how this will pan out.
For next test will do my best to use debug2 log level.So. I have an update.
Out of 8 servers that we prepared upgrade for, one failed with the
same/similar problem.
Unfortunately I don't know if this is the same server as before. Sorry.Specifically, I see now, in logs on focal/pg14 side:
#v+
2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-On source (bionic/pg12), I see:
#v+
=# select * from pg_replication_slots ;
slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn
═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
focal14 │ pgoutput │ logical │ 16607 │ canvas │ f │ f │ [null] │ [null] │ 3241443528 │ 1039D/83825958 │ 1039D/96453F38
(1 row)
#v-but pg, like previously, lost some wal files, this time 8 of them:
#v+
=$ sudo ls -l ~postgres/12/main/pg_wal | head
total 224403644
-rw------- 1 postgres postgres 349 Feb 11 2022 000000010000BDEF000000FF.00045F68.backup
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
...
#v-This time we had debug2 enabled on both sides.
Looks like we lost segments:
000000010001039D00000083, 000000010001039D00000084,
000000010001039D00000085, 000000010001039D00000086,
000000010001039D00000087, 000000010001039D00000088,
000000010001039D00000089, and 000000010001039D0000008A.It looks that first time replica logged the error was:
#v+
2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
#v-In Pg logs on source I see:
#v+
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
#v-So, the whole set of missing files was recycled at the same time.
One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
#v+
timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn
...
...
2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
...
I think from the above two it is clear that the slot with restart_lsn
1039D/83825958 is active and we seem to have recycled corresponding
segment.
Now, I can keep this situation up for some time, if more debug will be helpful,
but I don't know how long (disk space issues).Is there anything in here that helps? Any other info I can dig out?
You may have the LOG for "attempting to remove WAL segments older than
log file %s", if so, please share. Also on similar lines, I think we
need to add some LOGs before calling RemoveOldXlogFiles() to know
where our computation to remove files goes off. Something, like the
attached, would be helpful but note I have added this quickly on my
working branch to show you what I have in mind. You may need small
changes based on the branch you are working on.
--
With Regards,
Amit Kapila.
Attachments:
additional_logs_1.patchapplication/octet-stream; name=additional_logs_1.patchDownload+27-12
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
#v+
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
...
#v-
So, the whole set of missing files was recycled at the same time.
One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
#v+
timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn...
...2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0...
I think from the above two it is clear that the slot with restart_lsn
1039D/83825958 is active and we seem to have recycled corresponding
segment.You may have the LOG for "attempting to remove WAL segments older than
log file %s", if so, please share. Also on similar lines, I think we
Absolutely.
There is something weird happening:
#v+
2022-11-10 20:01:07.520 UTC,,,28925,,636378d6.70fd,4488667,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039B000000D9",,,,,,,,,""
2022-11-10 20:06:07.359 UTC,,,28925,,636378d6.70fd,4496840,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039B000000F8",,,,,,,,,""
2022-11-10 20:11:07.522 UTC,,,28925,,636378d6.70fd,4505099,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C00000027",,,,,,,,,""
2022-11-10 20:16:07.466 UTC,,,28925,,636378d6.70fd,4513331,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C00000054",,,,,,,,,""
2022-11-10 20:21:07.572 UTC,,,28925,,636378d6.70fd,4521466,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C0000008F",,,,,,,,,""
2022-11-10 20:26:07.384 UTC,,,28925,,636378d6.70fd,4529518,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C000000B9",,,,,,,,,""
2022-11-10 20:31:07.619 UTC,,,28925,,636378d6.70fd,4537868,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C000000F8",,,,,,,,,""
2022-11-10 20:36:07.532 UTC,,,28925,,636378d6.70fd,4546079,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000002B",,,,,,,,,""
2022-11-10 20:41:07.603 UTC,,,28925,,636378d6.70fd,4554468,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000052",,,,,,,,,""
2022-11-10 20:46:07.609 UTC,,,28925,,636378d6.70fd,4562376,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
2022-11-10 20:51:07.513 UTC,,,28925,,636378d6.70fd,4570523,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
2022-11-10 20:56:07.727 UTC,,,28925,,636378d6.70fd,4578553,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000070",,,,,,,,,""
2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000082",,,,,,,,,""
2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
#v-
If I'll add "recycled" lines we will get:
#v+
...
2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000082",,,,,,,,,""
2022-11-10 21:01:07.512 UTC,,,28925,,636378d6.70fd,4586722,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000071""",,,,,,,,,""
2022-11-10 21:01:07.514 UTC,,,28925,,636378d6.70fd,4586723,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000072""",,,,,,,,,""
2022-11-10 21:01:07.515 UTC,,,28925,,636378d6.70fd,4586724,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000073""",,,,,,,,,""
2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586725,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000074""",,,,,,,,,""
2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586726,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000075""",,,,,,,,,""
2022-11-10 21:01:07.519 UTC,,,28925,,636378d6.70fd,4586727,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000076""",,,,,,,,,""
2022-11-10 21:01:07.520 UTC,,,28925,,636378d6.70fd,4586728,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000077""",,,,,,,,,""
2022-11-10 21:01:07.521 UTC,,,28925,,636378d6.70fd,4586729,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000078""",,,,,,,,,""
2022-11-10 21:01:07.522 UTC,,,28925,,636378d6.70fd,4586730,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000079""",,,,,,,,,""
2022-11-10 21:01:07.523 UTC,,,28925,,636378d6.70fd,4586731,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007A""",,,,,,,,,""
2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586732,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007B""",,,,,,,,,""
2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586733,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007C""",,,,,,,,,""
2022-11-10 21:01:07.525 UTC,,,28925,,636378d6.70fd,4586734,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007D""",,,,,,,,,""
2022-11-10 21:01:07.527 UTC,,,28925,,636378d6.70fd,4586735,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007E""",,,,,,,,,""
2022-11-10 21:01:07.528 UTC,,,28925,,636378d6.70fd,4586736,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007F""",,,,,,,,,""
2022-11-10 21:01:07.529 UTC,,,28925,,636378d6.70fd,4586737,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000080""",,,,,,,,,""
2022-11-10 21:01:07.530 UTC,,,28925,,636378d6.70fd,4586738,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000081""",,,,,,,,,""
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
#v-
need to add some LOGs before calling RemoveOldXlogFiles() to know
where our computation to remove files goes off. Something, like the
attached, would be helpful but note I have added this quickly on my
working branch to show you what I have in mind. You may need small
changes based on the branch you are working on.
That will be complicated.
Changing pg (recompile, and rerun) requires careful preparation with our
customers. Which takes time.
Additionally we are using precompiled binaries from pgdg, that makes the
process slightly more involved.
And to top it off, as I mailed in the other message - the problem doesn't
happen always. For 8 clusters that I tried to prepare problem happened on two,
and on one of them (this one actually) it kinda solved on its own?!
I know I'm asking a lot, but is there any chance we could setup
conference/shared screen session to try to get as much data as possible
out of the situations we have now? Of course that would be paid
consulting. If you're OK with it, please mail me off list, so we can
arrange it.
Best regards,
depesz
On Mon, Nov 14, 2022 at 5:47 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
#v+
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""...
#v-
So, the whole set of missing files was recycled at the same time.
One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
#v+
timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn...
...2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0...
I think from the above two it is clear that the slot with restart_lsn
1039D/83825958 is active and we seem to have recycled corresponding
segment.You may have the LOG for "attempting to remove WAL segments older than
log file %s", if so, please share. Also on similar lines, I think weAbsolutely.
There is something weird happening:
What exactly weird you are seeing in this? To me, it appears as if the
system due to some reason ignores an existing slot that has
restart_lsn as 1039D/83825958.
If I'll add "recycled" lines we will get:
#v+
...
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
Actually, the last LOG was what I was not expecting because of the
existence of a corresponding slot. This is what we have to find out
how it has computed the value "000000000001039D0000008A" for the file
to be removed.
need to add some LOGs before calling RemoveOldXlogFiles() to know
where our computation to remove files goes off. Something, like the
attached, would be helpful but note I have added this quickly on my
working branch to show you what I have in mind. You may need small
changes based on the branch you are working on.That will be complicated.
Changing pg (recompile, and rerun) requires careful preparation with our
customers. Which takes time.Additionally we are using precompiled binaries from pgdg, that makes the
process slightly more involved.
I can understand but OTOH it is not easy to proceed without more information.
--
With Regards,
Amit Kapila.