Dealing with latency to replication slave; what to do?

Started by Rory Falloonalmost 8 years ago4 messagesgeneral

rfalloon@gmail.com

almost 8 years ago

Hi,

Looking for any tips here on how to best maintain a replication slave which
is operating under some latency between networks - around 230ms. On a good
day/week, replication will keep up for a number of days, but however, when
the link is under higher than average usage, keeping replication active can
last merely minutes before falling behind again.

2018-07-24 18:46:14 GMTLOG: database system is ready to accept read only
connections
2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
2B/93000000 on timeline 1
2018-07-24 18:59:28 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:37 GMTLOG: incomplete startup packet

As you can see above, it lasted about half an hour before falling out of
sync.

On the master, I have wal_keep_segments=128. What is happening when I see
"incomplete startup packet" - is it simply the slave has fallen behind,
and cannot 'catch up' using the wal segments quick enough? I assume the
slave is using the wal segments to replay changes and assuming there are
enough wal segments to cover the period it cannot stream properly, it will
eventually recover?

andres@anarazel.de

almost 8 years ago

In reply to: Rory Falloon (#1)

Re: Dealing with latency to replication slave; what to do?

Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

Looking for any tips here on how to best maintain a replication slave which
is operating under some latency between networks - around 230ms. On a good
day/week, replication will keep up for a number of days, but however, when
the link is under higher than average usage, keeping replication active can
last merely minutes before falling behind again.

2018-07-24 18:46:14 GMTLOG: database system is ready to accept read only
connections
2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
2B/93000000 on timeline 1
2018-07-24 18:59:28 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:37 GMTLOG: incomplete startup packet

As you can see above, it lasted about half an hour before falling out of
sync.

How can we see that from the above? The "incomplete startup messages"
are independent of streaming rep? I think you need to show us more logs.

On the master, I have wal_keep_segments=128. What is happening when I see
"incomplete startup packet" - is it simply the slave has fallen behind,
and cannot 'catch up' using the wal segments quick enough? I assume the
slave is using the wal segments to replay changes and assuming there are
enough wal segments to cover the period it cannot stream properly, it will
eventually recover?

You might want to look into replication slots to ensure the primary
keeps the necessary segments, but not more, around. You might also want
to look at wal_compression, to reduce the bandwidth usage.

Greetings,

Andres Freund

rfalloon@gmail.com

almost 8 years ago

In reply to: Andres Freund (#2)

Re: Dealing with latency to replication slave; what to do?

Hi Andres,

regarding your first reply, I was inferring that from the fact I saw those
messages at the same time the replication stream fell behind. What other
logs would be more pertinent to this situation?

On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

Looking for any tips here on how to best maintain a replication slave

which

is operating under some latency between networks - around 230ms. On a

good

day/week, replication will keep up for a number of days, but however,

when

the link is under higher than average usage, keeping replication active

can

last merely minutes before falling behind again.

2018-07-24 18:46:14 GMTLOG: database system is ready to accept read only
connections
2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
2B/93000000 on timeline 1
2018-07-24 18:59:28 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:37 GMTLOG: incomplete startup packet

As you can see above, it lasted about half an hour before falling out of
sync.

How can we see that from the above? The "incomplete startup messages"
are independent of streaming rep? I think you need to show us more logs.

On the master, I have wal_keep_segments=128. What is happening when I see
"incomplete startup packet" - is it simply the slave has fallen behind,
and cannot 'catch up' using the wal segments quick enough? I assume the
slave is using the wal segments to replay changes and assuming there are
enough wal segments to cover the period it cannot stream properly, it

will

eventually recover?

You might want to look into replication slots to ensure the primary
keeps the necessary segments, but not more, around. You might also want
to look at wal_compression, to reduce the bandwidth usage.

Greetings,

Andres Freund

jeff.janes@gmail.com

almost 8 years ago

In reply to: Rory Falloon (#3)

Re: Dealing with latency to replication slave; what to do?

Please don't top-post, it is not the custom on this list.

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon@gmail.com> wrote:

On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

Looking for any tips here on how to best maintain a replication slave

which

is operating under some latency between networks - around 230ms. On a

good

day/week, replication will keep up for a number of days, but however,

when

the link is under higher than average usage, keeping replication active

can

last merely minutes before falling behind again.

2018-07-24 18:46:14 GMTLOG: database system is ready to accept read

only

connections
2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
2B/93000000 on timeline 1
2018-07-24 18:59:28 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:37 GMTLOG: incomplete startup packet

As you can see above, it lasted about half an hour before falling out of
sync.

How can we see that from the above? The "incomplete startup messages"
are independent of streaming rep? I think you need to show us more logs.

regarding your first reply, I was inferring that from the fact I saw those
messages at the same time the replication stream fell behind. What other
logs would be more pertinent to this situation?

This is circular. You think it lost sync because you saw some message you
didn't recognize, and then you think the error message was related to it
losing sync because they occured at the same time. What evidence do you
have that it has lost sync at all? From the log file you posted, it seems
the server is running fine and is just getting probed by a port scanner, or
perhaps by a monitoring tool.

If it had lost sync, you would be getting log messages about "requested WAL
segment has already been removed"

Cheers,

Jeff

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon@gmail.com> wrote:

Show quoted text

Hi Andres,

regarding your first reply, I was inferring that from the fact I saw those
messages at the same time the replication stream fell behind. What other
logs would be more pertinent to this situation?

On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

Looking for any tips here on how to best maintain a replication slave

which

is operating under some latency between networks - around 230ms. On a

good

day/week, replication will keep up for a number of days, but however,

when

the link is under higher than average usage, keeping replication active

can

last merely minutes before falling behind again.

2018-07-24 18:46:14 GMTLOG: database system is ready to accept read

only

connections
2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at
2B/93000000 on timeline 1
2018-07-24 18:59:28 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:36 GMTLOG: incomplete startup packet
2018-07-24 19:15:37 GMTLOG: incomplete startup packet

As you can see above, it lasted about half an hour before falling out of
sync.

How can we see that from the above? The "incomplete startup messages"
are independent of streaming rep? I think you need to show us more logs.

On the master, I have wal_keep_segments=128. What is happening when I

see

"incomplete startup packet" - is it simply the slave has fallen behind,
and cannot 'catch up' using the wal segments quick enough? I assume the
slave is using the wal segments to replay changes and assuming there are
enough wal segments to cover the period it cannot stream properly, it

will

eventually recover?

You might want to look into replication slots to ensure the primary
keeps the necessary segments, but not more, around. You might also want
to look at wal_compression, to reduce the bandwidth usage.

Greetings,

Andres Freund