Logical replication timeout problem
Hi,
Logical replication is configured on one instance in version 10.18. Timeout
errors occur regularly and the worker process exit with an exit code 1
2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local]
LOG: duration: 1281408.171 ms statement: COPY schem.tab (col1, col2) FROM
stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic
analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system:
0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG: worker
process: logical replication worker for subscription 24106654 (PID 3770)
exited with exit code 1
2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG: logical
replication apply worker for subscription "subxxxx" has started
2021-09-16 12:07:50 CEST [13873]: [1-1]
user=repuser,db=foo,client=127.0.0.1 LOG: received replication command:
IDENTIFY_SYSTEM
Why this happen?
Thanks a lot for your help
Fabrice
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Hi,
Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker process exit with an exit code 1
2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG: duration: 1281408.171 ms statement: COPY schem.tab (col1, col2) FROM stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout
2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG: worker process: logical replication worker for subscription 24106654 (PID 3770) exited with exit code 1
2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG: logical replication apply worker for subscription "subxxxx" has started
2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG: received replication command: IDENTIFY_SYSTEM
Can you share the publisher-side log as well?
--
With Regards,
Amit Kapila.
the publisher and the subscriber run on the same postgres instance.
Regards,
Fabrice
On Fri, Sep 17, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
Show quoted text
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:Hi,
Logical replication is configured on one instance in version 10.18.
Timeout errors occur regularly and the worker process exit with an exit
code 12021-09-16 12:06:50 CEST [24881]: [1-1]
user=postgres,db=foo,client=[local] LOG: duration: 1281408.171 ms
statement: COPY schem.tab (col1, col2) FROM stdin;2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:
automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s,
system: 0.55 s, elapsed: 9.58 s2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG: worker
process: logical replication worker for subscription 24106654 (PID 3770)
exited with exit code 12021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG: logical
replication apply worker for subscription "subxxxx" has started
2021-09-16 12:07:50 CEST [13873]: [1-1]
user=repuser,db=foo,client=127.0.0.1 LOG: received replication command:
IDENTIFY_SYSTEMCan you share the publisher-side log as well?
--
With Regards,
Amit Kapila.
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
the publisher and the subscriber run on the same postgres instance.
Okay, but there is no log corresponding to operations being performed
by the publisher. By looking at current logs it is not very clear to
me what might have caused this. Did you try increasing
wal_sender_timeout and wal_receiver_timeout?
--
With Regards,
Amit Kapila.
Hi Amit,
We can replay the problem: we load a table of several Gb in the schema of
the publisher, this generates the worker's timeout after one minute from
the end of this load. The table on which this load is executed is not
replicated.
2021-09-16 12:06:50 CEST [24881]: [1-1]
user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms
statement: COPY db.table (col1, col2) FROM stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic
analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55
s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
Before increasing value for wal_sender_timeout and wal_receiver_timeout I
thought to further investigate the mechanisms leading to this timeout.
Thanks for your help
Fabrice
On Sun, Sep 19, 2021 at 6:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:the publisher and the subscriber run on the same postgres instance.
Okay, but there is no log corresponding to operations being performed
by the publisher. By looking at current logs it is not very clear to
me what might have caused this. Did you try increasing
wal_sender_timeout and wal_receiver_timeout?--
With Regards,
Amit Kapila.
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Hi Amit,
We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeout after one minute from the end of this load. The table on which this load is executed is not replicated.
2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms statement: COPY db.table (col1, col2) FROM stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout
Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanisms leading to this timeout.
The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.
--
With Regards,
Amit Kapila.
On Mon, Sep 20, 2021 at 5:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Hi Amit,
We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeout after one minute from the end of this load. The table on which this load is executed is not replicated.
2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms statement: COPY db.table (col1, col2) FROM stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout
Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanisms leading to this timeout.
The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.
The other thing we can try out is to check the data in pg_locks on
publisher during one minute after the large copy is finished. This we
can try out both with and without autovacuum.
--
With Regards,
Amit Kapila.
By passing the autovacuum parameter to off the problem did not occur right
after loading the table as in our previous tests. However, the timeout
occurred later. We have seen the accumulation of .snap files for several Gb.
...
-rw-------. 1 postgres postgres 16791226 Sep 20 15:26
xid-1238444701-lsn-2D2B-F5000000.snap
-rw-------. 1 postgres postgres 16973268 Sep 20 15:26
xid-1238444701-lsn-2D2B-F6000000.snap
-rw-------. 1 postgres postgres 16790984 Sep 20 15:26
xid-1238444701-lsn-2D2B-F7000000.snap
-rw-------. 1 postgres postgres 16988112 Sep 20 15:26
xid-1238444701-lsn-2D2B-F8000000.snap
-rw-------. 1 postgres postgres 16864593 Sep 20 15:26
xid-1238444701-lsn-2D2B-F9000000.snap
-rw-------. 1 postgres postgres 16902167 Sep 20 15:26
xid-1238444701-lsn-2D2B-FA000000.snap
-rw-------. 1 postgres postgres 16914638 Sep 20 15:26
xid-1238444701-lsn-2D2B-FB000000.snap
-rw-------. 1 postgres postgres 16782471 Sep 20 15:26
xid-1238444701-lsn-2D2B-FC000000.snap
-rw-------. 1 postgres postgres 16963667 Sep 20 15:27
xid-1238444701-lsn-2D2B-FD000000.snap
...
2021-09-20 17:11:29 CEST [12687]: [1283-1] user=,db=,client= LOG:
checkpoint starting: time
2021-09-20 17:11:31 CEST [12687]: [1284-1] user=,db=,client= LOG:
checkpoint complete: wrote 13 buffers (0.0%); 0 WAL file(s) added, 0
removed, 0 recycled; write=1.713 s, sync=0.001 s, total=1.718 s
; sync files=12, longest=0.001 s, average=0.001 s; distance=29 kB,
estimate=352191 kB
2021-09-20 17:12:43 CEST [59986]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2021-09-20 17:12:43 CEST [12546]: [1068-1] user=,db=,client= LOG: worker
process: logical replication worker for subscription 24215702 (PID 59986)
exited with exit code 1
2021-09-20 17:12:43 CEST [39945]: [1-1] user=,db=,client= LOG: logical
replication apply worker for subscription "sub" has started
2021-09-20 17:12:43 CEST [39946]: [1-1] user=repuser,db=db,client=127.0.0.1
LOG: received replication command: IDENTIFY_SYSTEM
Regards,
Fabrice
On Mon, Sep 20, 2021 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:Hi Amit,
We can replay the problem: we load a table of several Gb in the schema
of the publisher, this generates the worker's timeout after one minute from
the end of this load. The table on which this load is executed is not
replicated.2021-09-16 12:06:50 CEST [24881]: [1-1]
user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms
statement: COPY db.table (col1, col2) FROM stdin;2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:
automatic analyze of table "db.table " system usage: CPU: user: 4.13 s,
system: 0.55 s, elapsed: 9.58 s2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
Before increasing value for wal_sender_timeout and wal_receiver_timeout
I thought to further investigate the mechanisms leading to this timeout.
The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.--
With Regards,
Amit Kapila.
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests. However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.
...
-rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
-rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
-rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
-rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
-rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
-rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
-rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
-rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
-rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
...
Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.
Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?
--
With Regards,
Amit Kapila.
If I understand, the instruction to send keep alive by the wal sender has
not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
WalSndCheckTimeOut();
/* Send keepalive if the time has come */
WalSndKeepaliveIfNecessary();
...
The data load is performed on a table which is not replicated, I do not
understand why the whole transaction linked to an insert is copied to snap
files given that table does not take part of the logical replication.
We are going to do a test by modifying parameters
wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
these parameters are global and changing them will also impact the physical
replication.
Concerning the walsender timeout, when the worker is started again after a
timeout, it will trigger a new walsender associated with it.
postgres 55680 12546 0 Sep20 ? 00:00:02 postgres: aq: bgworker:
logical replication worker for subscription 24651602
postgres 55681 12546 0 Sep20 ? 00:00:00 postgres: aq: wal sender
process repuser 127.0.0.1(57930) idle
Kind Regards
Fabrice
On Tue, Sep 21, 2021 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:By passing the autovacuum parameter to off the problem did not occur
right after loading the table as in our previous tests. However, the
timeout occurred later. We have seen the accumulation of .snap files for
several Gb....
-rw-------. 1 postgres postgres 16791226 Sep 20 15:26xid-1238444701-lsn-2D2B-F5000000.snap
-rw-------. 1 postgres postgres 16973268 Sep 20 15:26
xid-1238444701-lsn-2D2B-F6000000.snap
-rw-------. 1 postgres postgres 16790984 Sep 20 15:26
xid-1238444701-lsn-2D2B-F7000000.snap
-rw-------. 1 postgres postgres 16988112 Sep 20 15:26
xid-1238444701-lsn-2D2B-F8000000.snap
-rw-------. 1 postgres postgres 16864593 Sep 20 15:26
xid-1238444701-lsn-2D2B-F9000000.snap
-rw-------. 1 postgres postgres 16902167 Sep 20 15:26
xid-1238444701-lsn-2D2B-FA000000.snap
-rw-------. 1 postgres postgres 16914638 Sep 20 15:26
xid-1238444701-lsn-2D2B-FB000000.snap
-rw-------. 1 postgres postgres 16782471 Sep 20 15:26
xid-1238444701-lsn-2D2B-FC000000.snap
-rw-------. 1 postgres postgres 16963667 Sep 20 15:27
xid-1238444701-lsn-2D2B-FD000000.snap
...
Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?--
With Regards,
Amit Kapila.
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
WalSndCheckTimeOut();/* Send keepalive if the time has come */
WalSndKeepaliveIfNecessary();
...
Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?
The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1]https://www.postgresql.org/docs/devel/sql-createsubscription.html.
We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.
Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.
Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.
Right, I know that but I was curious to know if the walsender has
exited before walreceiver.
[1]: https://www.postgresql.org/docs/devel/sql-createsubscription.html
--
With Regards,
Amit Kapila.
IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?
I don't know, as you say, to highlight the problem we would have to debug
the WalSndKeepaliveIfNecessary function
I was curious to know if the walsender has exited before walreceiver
During the last tests we made we didn't observe any timeout of the wal
sender process.
Do you mean you are planning to change from 1 minute to 5 minutes?
We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test.
The result is surprising and rather positive there is no timeout any more
in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so
quickly.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout
parameters, are theses values appropriate from your point of view?
Best Regards
Fabrice
On Tue, Sep 21, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:
Show quoted text
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:If I understand, the instruction to send keep alive by the wal sender
has not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
WalSndCheckTimeOut();/* Send keepalive if the time has come */
WalSndKeepaliveIfNecessary();
...Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?The data load is performed on a table which is not replicated, I do not
understand why the whole transaction linked to an insert is copied to snap
files given that table does not take part of the logical replication.It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].We are going to do a test by modifying parameters
wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
these parameters are global and changing them will also impact the physical
replication.Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.Concerning the walsender timeout, when the worker is started again after
a timeout, it will trigger a new walsender associated with it.
Right, I know that but I was curious to know if the walsender has
exited before walreceiver.[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html
--
With Regards,
Amit Kapila.
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?
I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function
I was curious to know if the walsender has exited before walreceiver
During the last tests we made we didn't observe any timeout of the wal sender process.
Do you mean you are planning to change from 1 minute to 5 minutes?
We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positive there is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so quickly.
I think it is because we decide that the data in those snap files
doesn't need to be sent at xact end, so we remove them.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate from your point of view?
It is difficult to say what is the appropriate value for these
parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
find why it didn't send keep alive when it is expected. Would you be
able to make code changes and test or if you want I can make changes
and send the patch if you can test it? If not, is it possible that in
some way you send a reproducible test?
--
With Regards,
Amit Kapila.
If you would like I can test the patch you send to me.
Regards
Fabrice
On Wed, Sep 22, 2021 at 11:02 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:
Show quoted text
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?
I don't know, as you say, to highlight the problem we would have to
debug the WalSndKeepaliveIfNecessary function
I was curious to know if the walsender has exited before walreceiver
During the last tests we made we didn't observe any timeout of the wal
sender process.
Do you mean you are planning to change from 1 minute to 5 minutes?
We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new
test. The result is surprising and rather positive there is no timeout any
more in the log and the 20Gb of snap files are removed in less than 5
minutes.How to explain that behaviour, why the snap files are consumed suddenly
so quickly.
I think it is because we decide that the data in those snap files
doesn't need to be sent at xact end, so we remove them.I choose the value arbitrarily for
wal_sender_timeout/wal_receiver_timeout parameters, are theses values
appropriate from your point of view?It is difficult to say what is the appropriate value for these
parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
find why it didn't send keep alive when it is expected. Would you be
able to make code changes and test or if you want I can make changes
and send the patch if you can test it? If not, is it possible that in
some way you send a reproducible test?--
With Regards,
Amit Kapila.
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
If you would like I can test the patch you send to me.
Okay, please find an attached patch for additional logs. I would like
to see the logs during the time when walsender appears to be writing
to files. We might need to add more logs to find the exact problem but
let's start with this.
--
With Regards,
Amit Kapila.
Attachments:
log_keep_alive_1.patchapplication/octet-stream; name=log_keep_alive_1.patchDownload+4-0
Thanks for your patch, we are going to set up a lab in order to debug the
function.
Regards
Fabrice
On Thu, Sep 23, 2021 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:If you would like I can test the patch you send to me.
Okay, please find an attached patch for additional logs. I would like
to see the logs during the time when walsender appears to be writing
to files. We might need to add more logs to find the exact problem but
let's start with this.--
With Regards,
Amit Kapila.
On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com<mailto:fabrice636861@gmail.com>> wrote:
Thanks for your patch, we are going to set up a lab in order to debug the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.
And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and check if this problem could happen.
If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce the problem.
Regards
Tang
Attachments:
Thanks Tang for your script.
Our debugging environment will be ready soon. I will test your script and
we will try to reproduce the problem by integrating the patch provided by
Amit. As soon as I have results I will let you know.
Regards
Fabrice
On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com>
wrote:
Show quoted text
On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <
fabrice636861@gmail.com> wrote:Thanks for your patch, we are going to set up a lab in order to debug
the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took
more than 1 minute to replicate.And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and
check if this problem could happen.If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce
the problem.Regards
Tang
Hello,
Our lab is ready now. Amit, I compile Postgres 10.18 with your patch.Tang,
I used your script to configure logical replication between 2 databases and
to generate 10 million entries in an unreplicated foo table. On a
standalone instance no error message appears in log.
I activate the physical replication between 2 nodes, and I got following
error:
2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive message
2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION 0/3000000
TIMELINE 1
2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication
worker due to administrator command
2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical
replication worker for subscription 16413 (PID 12064) exited with exit code
1
2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive message
This message look like strange because no admin command have been executed
during data load.
I did not find any error related to the timeout.
The message coming from the modification made with the patch comes back all
the time: attempt to send keep alive message. But there is no "sent keep
alive message".
Why logical replication worker exit when physical replication is configured?
Thanks for your help
Fabrice
On Fri, Oct 8, 2021 at 9:33 AM Fabrice Chapuis <fabrice636861@gmail.com>
wrote:
Show quoted text
Thanks Tang for your script.
Our debugging environment will be ready soon. I will test your script and
we will try to reproduce the problem by integrating the patch provided by
Amit. As soon as I have results I will let you know.Regards
Fabrice
On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <
tanghy.fnst@fujitsu.com> wrote:On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <
fabrice636861@gmail.com> wrote:Thanks for your patch, we are going to set up a lab in order to debug
the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took
more than 1 minute to replicate.And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and
check if this problem could happen.If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce
the problem.Regards
Tang
On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
Hello,
Our lab is ready now. Amit, I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical replication between 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone instance no error message appears in log.
I activate the physical replication between 2 nodes, and I got following error:2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive message
2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION 0/3000000 TIMELINE 1
2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication worker due to administrator command
2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical replication worker for subscription 16413 (PID 12064) exited with exit code 1
2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive messageThis message look like strange because no admin command have been executed during data load.
I did not find any error related to the timeout.
The message coming from the modification made with the patch comes back all the time: attempt to send keep alive message. But there is no "sent keep alive message".Why logical replication worker exit when physical replication is configured?
I am also not sure why that happened may be due to
max_worker_processes reaching its limit. This can happen because it
seems you configured both publisher and subscriber in the same
cluster. Tang, did you also see the same problem?
BTW, why are you bringing physical standby configuration into the
test? Does in your original setup where you observe the problem the
physical standbys were there?
--
With Regards,
Amit Kapila.