Broken logical replication

Started by Игорь Выскоркоabout 4 years ago2 messagesgeneral

vyskorko.igor@yandex.ru

about 4 years ago

<div><div><div>Giving 2 postgres servers:</div><div>1. Master - PostgreSQL 10.16</div><div>2. Slave - PostgreSQL 13.5</div><div> </div><div>Logical replication was configured and worked fine between them. </div><div>At some point software (russian accounting soft: 1C) which is using master DB, decided to do some db-related tasks. I know only about full reindexing - all indexes are built again - probably there were more DDL...</div><div>Ok, when it's finished I found that replication was broken:</div><div>1. Publication object exists but all publication tables were dropped. I mean not tables itself but: </div><div>select * from pg_publication_tables gave 0 rows. Ok, I've added them again - didn't help</div><div>2. In log I've found repeated bunch of records:</div><div>янв 20 23:33:29 testpg postgres[6291]: [9-1] 2022-01-20 16:33:29.087 UTC [6291] ERROR: replication slot "subscr" is active for PID 3984</div><div>янв 20 23:33:34 testpg postgres[6295]: [7-1] 2022-01-20 16:33:34.101 UTC [6295] LOG: connection received: host=192.168.7.225 port=43428</div><div>янв 20 23:33:34 testpg postgres[6295]: [8-1] 2022-01-20 16:33:34.102 UTC [6295] LOG: replication connection authorized: user=rep_user</div><div>янв 20 23:33:34 testpg postgres[6295]: [9-1] 2022-01-20 16:33:34.104 UTC [6295] ERROR: replication slot "subscr" is active for PID 3984</div><div>янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC [6298] LOG: connection received: host=192.168.7.225 port=43470</div><div>янв 20 23:33:39 testpg postgres[6298]: [8-1] 2022-01-20 16:33:39.118 UTC [6298] LOG: replication connection authorized: user=rep_user</div><div>янв 20 23:33:39 testpg postgres[6298]: [9-1] 2022-01-20 16:33:39.537 UTC [6298] LOG: starting logical decoding for slot "subscr"</div><div>янв 20 23:33:39 testpg postgres[6298]: [9-2] 2022-01-20 16:33:39.537 UTC [6298] DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8</div><div>янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8</div><div>янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions.</div><div> </div><div>3. Replication status:</div><div>db=# select * from pg_stat_replication ;</div><div>-[ RECORD 1 ]----+------------------------------</div><div>pid | 6298</div><div>usesysid | 16384</div><div>usename | rep_user</div><div>application_name | subscr</div><div>client_addr | 192.168.7.225</div><div>client_hostname |</div><div>client_port | 43470</div><div>backend_start | 2022-01-20 16:33:39.117019+00</div><div>backend_xmin |</div><div>state | catchup</div><div>sent_lsn | 1DE/E849BFD8</div><div>write_lsn | 1E0/3449C020</div><div>flush_lsn | 1E0/3449C020</div><div>replay_lsn | 1E0/3449C020</div><div>write_lag |</div><div>flush_lag |</div><div>replay_lag |</div><div>sync_priority | 0</div><div>sync_state | async</div><div> </div><div>sent_lsn is lower than write_lsn, what? Is it legal?)</div><div> </div><div>4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. </div><div> </div><div>So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?</div></div></div>

Игорь Выскорко

vyskorko.igor@yandex.ru

about 4 years ago

In reply to: Игорь Выскорко (#1)

Re: Broken logical replication

Hi experts!
I didn't notice html tags in previous message. Sorry for that. Hope that
was the only reason of no answers :)

Original message was:
Giving 2 postgres servers:
1. Master - PostgreSQL 10.16
2. Slave - PostgreSQL 13.5
Logical replication was configured and worked fine between them.
At some point software (russian accounting soft: 1C) which is using
master DB, decided to do some db-related tasks. I know only about full
reindexing - all indexes are built again - probably there were more DDL...
Ok, when it's finished I found that replication was broken:
1. Publication object exists but all publication tables were dropped. I
mean not tables itself but:
select * from pg_publication_tables gave 0 rows. Ok, I've added them
again - didn't help
2. In log I've found repeated bunch of records:
янв 20 23:33:29 testpg postgres[6291]ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:34 testpg postgres[6295]: [7-1] 2022-01-20 16:33:34.101 UTC: [9-1] 2022-01-20 16:33:29.087 UTC
[6291]: ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:34 testpg postgres[6295]: [7-1] 2022-01-20 16:33:34.101 UTC
янв 20 23:33:34 testpg postgres[6295]ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC: [7-1] 2022-01-20 16:33:34.101 UTC
[6295]: ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC
янв 20 23:33:34 testpg postgres[6295]ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC: [8-1] 2022-01-20 16:33:34.102 UTC
[6295]: ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC
янв 20 23:33:34 testpg postgres[6295]ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC: [9-1] 2022-01-20 16:33:34.104 UTC
[6295]: ERROR: replication slot "subscr" is active for PID 3984 янв 20 23:33:39 testpg postgres[6298]: [7-1] 2022-01-20 16:33:39.117 UTC
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [7-1] 2022-01-20 16:33:39.117 UTC
[6298]: DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [8-1] 2022-01-20 16:33:39.118 UTC
[6298]: DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [9-1] 2022-01-20 16:33:39.537 UTC
[6298]: DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [9-2] 2022-01-20 16:33:39.537 UTC
[6298]: DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?
reading WAL from 1D9/EEAD19E8
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [10-1] 2022-01-20 16:33:39.538
UTC [6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)? LOG: logical decoding found consistent point at 1D9/EEAD19E8
янв 20 23:33:39 testpg postgres[6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)?: [10-2] 2022-01-20 16:33:39.538
UTC [6298]DETAIL: streaming transactions committing after 1E0/3449C020, reading WAL from 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-1] 2022-01-20 16:33:39.538 UTC [6298] LOG: logical decoding found consistent point at 1D9/EEAD19E8 янв 20 23:33:39 testpg postgres[6298]: [10-2] 2022-01-20 16:33:39.538 UTC [6298] DETAIL: There are no running transactions. 3. Replication status: db=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 6298 usesysid | 16384 usename | rep_user application_name | subscr client_addr | 192.168.7.225 client_hostname | client_port | 43470 backend_start | 2022-01-20 16:33:39.117019+00 backend_xmin | state | catchup sent_lsn | 1DE/E849BFD8 write_lsn | 1E0/3449C020 flush_lsn | 1E0/3449C020 replay_lsn | 1E0/3449C020 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async sent_lsn is lower than write_lsn, what? Is it legal?) 4. pg_wal directory was bloated - it's ok I suppose because of opened replication slot. So, my question here: is it possible to fix replication without full restart (truncating tables in slave and copy all data)? DETAIL: There are no running transactions.
3. Replication status:
db=# select * from pg_stat_replication ;
-[ RECORD 1 ]----+------------------------------
pid | 6298
usesysid | 16384
usename | rep_user
application_name | subscr
client_addr | 192.168.7.225
client_hostname |
client_port | 43470
backend_start | 2022-01-20 16:33:39.117019+00
backend_xmin |
state | catchup
sent_lsn | 1DE/E849BFD8
write_lsn | 1E0/3449C020
flush_lsn | 1E0/3449C020
replay_lsn | 1E0/3449C020
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
sent_lsn is lower than write_lsn, what? Is it legal?)
4. pg_wal directory was bloated - it's ok I suppose because of opened
replication slot.
So, my question here: is it possible to fix replication without full
restart (truncating tables in slave and copy all data)?