Walsender timeouts and large transactions

Started by Petr Jelinekalmost 9 years ago30 messageshackers

petr@2ndquadrant.com

almost 9 years ago

Hi,

We have had issue with walsender timeout when used with logical decoding
and the transaction is taking long time to be decoded (because it
contains many changes)

I was looking today at the walsender code and realized that it's because
if the network and downstream are fast enough, we'll always take fast
path in WalSndWriteData which does not do reply or keepalive processing
and is only reached once the transaction has finished by other code. So
paradoxically we die of timeout because everything was fast enough to
never fall back to slow code path.

I propose we only use fast path if the last processed reply is not older
than half of walsender timeout, if it is then we'll force the slow code
path to process the replies again. This is similar logic that we use to
determine if to send keepalive message. I also added CHECK_INTERRUPRS
call to fast code path because otherwise walsender might ignore them for
too long on large transactions.

Thoughts?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 9 years ago

In reply to: Petr Jelinek (#1)

Re: Walsender timeouts and large transactions

At Thu, 25 May 2017 17:52:50 +0200, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com>

Hi,

We have had issue with walsender timeout when used with logical decoding
and the transaction is taking long time to be decoded (because it
contains many changes)

I was looking today at the walsender code and realized that it's because
if the network and downstream are fast enough, we'll always take fast
path in WalSndWriteData which does not do reply or keepalive processing
and is only reached once the transaction has finished by other code. So
paradoxically we die of timeout because everything was fast enough to
never fall back to slow code path.

I propose we only use fast path if the last processed reply is not older
than half of walsender timeout, if it is then we'll force the slow code
path to process the replies again. This is similar logic that we use to
determine if to send keepalive message. I also added CHECK_INTERRUPRS
call to fast code path because otherwise walsender might ignore them for
too long on large transactions.

Thoughts?

+ TimestampTz now = GetCurrentTimestamp();

I think it is not recommended to read the current time too
frequently, especially within a loop that hates slowness. (I
suppose that a loop that can fill up a send queue falls into that
category.) If you don't mind a certain amount of additional
complexity for eliminating the possible slowdown by the check,
timeout would be usable. Attached patch does almost the same
thing with your patch but without busy time check.

What do you think about this?

# I saw that SIGQUIT doens't work for active publisher, which I
# think mention in another thread.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Petr Jelinek

petr@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: Walsender timeouts and large transactions

On 30/05/17 11:02, Kyotaro HORIGUCHI wrote:

At Thu, 25 May 2017 17:52:50 +0200, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com>

Hi,

We have had issue with walsender timeout when used with logical decoding
and the transaction is taking long time to be decoded (because it
contains many changes)

I was looking today at the walsender code and realized that it's because
if the network and downstream are fast enough, we'll always take fast
path in WalSndWriteData which does not do reply or keepalive processing
and is only reached once the transaction has finished by other code. So
paradoxically we die of timeout because everything was fast enough to
never fall back to slow code path.

I propose we only use fast path if the last processed reply is not older
than half of walsender timeout, if it is then we'll force the slow code
path to process the replies again. This is similar logic that we use to
determine if to send keepalive message. I also added CHECK_INTERRUPRS
call to fast code path because otherwise walsender might ignore them for
too long on large transactions.

Thoughts?

+ TimestampTz now = GetCurrentTimestamp();

I think it is not recommended to read the current time too
frequently, especially within a loop that hates slowness. (I
suppose that a loop that can fill up a send queue falls into that

Yeah that was my main worry for the patch as well, although given that
the loop does tuple processing it might not be very noticeable.

category.) If you don't mind a certain amount of additional
complexity for eliminating the possible slowdown by the check,
timeout would be usable. Attached patch does almost the same
thing with your patch but without busy time check.

What do you think about this?

I think we could do it that way.

# I saw that SIGQUIT doens't work for active publisher, which I
# think mention in another thread.

Ah missed that email I guess, but that's the missing CHECK_INTERRUPTS();
in the fast-path which btw your updated patch is missing as well.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

almost 9 years ago

In reply to: Petr Jelinek (#3)

Re: Walsender timeouts and large transactions

On 30/05/17 15:44, Petr Jelinek wrote:

On 30/05/17 11:02, Kyotaro HORIGUCHI wrote:

+ TimestampTz now = GetCurrentTimestamp();

I think it is not recommended to read the current time too
frequently, especially within a loop that hates slowness. (I
suppose that a loop that can fill up a send queue falls into that

Yeah that was my main worry for the patch as well, although given that
the loop does tuple processing it might not be very noticeable.

I realized we actually call GetCurrentTimestamp() there anyway (for the
pq_sendint64). So I just modified the patch to use the now variable
there instead which means there are no additional GetCurrentTimestamp()
calls compared to state before patch now.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Petr Jelinek (#4)

Re: Walsender timeouts and large transactions

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

There is no check for (last_reply_timestamp <= 0 || wal_sender_timeout <= 0) as in other places
(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is correct:
there are calls to ProcessConfigFile with SyncRepInitConfig, ProcessRepliesIfAny that can
last non-negligible time. It could lead to over sleeping due to larger computed sleeptime.
Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body loop after WalSndKeepaliveIfNecessary.
Is it necessary? But it looks harmless at least.

Could patch be reduced to check after first `if (!pg_is_sendpending())` ? like:

	if (!pq_is_send_pending())
-		return;
+	{
+		if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			return;
+		}
+		if (now <= TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout / 2))
+			return;
+	}

If not, what problem prevents?

The new status of this patch is: Waiting on Author

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

over 8 years ago

In reply to: Sokolov Yura (#5)

Re: Walsender timeouts and large transactions

On 02/08/17 19:35, Yura Sokolov wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

There is no check for (last_reply_timestamp <= 0 || wal_sender_timeout <= 0) as in other places
(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is correct:
there are calls to ProcessConfigFile with SyncRepInitConfig, ProcessRepliesIfAny that can
last non-negligible time. It could lead to over sleeping due to larger computed sleeptime.
Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body loop after WalSndKeepaliveIfNecessary.
Is it necessary? But it looks harmless at least.

We also need to do actual timeout handing so that the timeout is not
deferred to the end of the transaction (Which is why I moved `if
(!pg_is_send_pending())` under WalSndCheckTimeOut() and
WalSndKeepaliveIfNecessary() calls).

Could patch be reduced to check after first `if (!pg_is_sendpending())` ? like:

if (!pq_is_send_pending())
-		return;
+	{
+		if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			return;
+		}
+		if (now <= TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout / 2))
+			return;
+	}

If not, what problem prevents?

We should do CHECK_FOR_INTERRUPTS() independently of pq_is_send_pending
so that it's possible to stop walsender while it's processing large
transaction.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Craig Ringer

craig@2ndquadrant.com

over 8 years ago

In reply to: Petr Jelinek (#6)

Re: Walsender timeouts and large transactions

On 9 August 2017 at 21:23, Petr Jelinek <petr.jelinek@2ndquadrant.com>
wrote:

On 02/08/17 19:35, Yura Sokolov wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

There is no check for (last_reply_timestamp <= 0 || wal_sender_timeout

<= 0) as in other places

(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is correct:
there are calls to ProcessConfigFile with SyncRepInitConfig,

ProcessRepliesIfAny that can

last non-negligible time. It could lead to over sleeping due to larger

computed sleeptime.

Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body loop

after WalSndKeepaliveIfNecessary.

Is it necessary? But it looks harmless at least.

We also need to do actual timeout handing so that the timeout is not
deferred to the end of the transaction (Which is why I moved `if
(!pg_is_send_pending())` under WalSndCheckTimeOut() and
WalSndKeepaliveIfNecessary() calls).

Could patch be reduced to check after first `if (!pg_is_sendpending())`

? like:
if (!pq_is_send_pending())
-             return;
+     {
+             if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0)
+             {
+                     CHECK_FOR_INTERRUPTS();
+                     return;
+             }
+             if (now <= TimestampTzPlusMilliseconds(last_reply_timestamp,
wal_sender_timeout / 2))

+ return;
+ }

If not, what problem prevents?

We should do CHECK_FOR_INTERRUPTS() independently of pq_is_send_pending
so that it's possible to stop walsender while it's processing large
transaction.

Is there any chance of getting this bugfix into Pg 10?

We've just cut back branches, so it'd be a sensible time.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Petr Jelinek (#6)

Re: Walsender timeouts and large transactions

On 2017-08-09 16:23, Petr Jelinek wrote:

On 02/08/17 19:35, Yura Sokolov wrote:

The following review has been posted through the commitfest
application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

There is no check for (last_reply_timestamp <= 0 || wal_sender_timeout
<= 0) as in other places
(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is
correct:
there are calls to ProcessConfigFile with SyncRepInitConfig,
ProcessRepliesIfAny that can
last non-negligible time. It could lead to over sleeping due to larger
computed sleeptime.
Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body loop
after WalSndKeepaliveIfNecessary.
Is it necessary? But it looks harmless at least.

We also need to do actual timeout handing so that the timeout is not
deferred to the end of the transaction (Which is why I moved `if
(!pg_is_send_pending())` under WalSndCheckTimeOut() and
WalSndKeepaliveIfNecessary() calls).

If standby really stalled, then it will not read from socket, and then
`pg_is_sendpending` eventually will return false, and timeout will be
checked.
If standby doesn't stall, then `last_reply_timestamp` will be updated in
`ProcessRepliesIfAny`, and so timeout will not be triggered.
Do I understand correctly, or I missed something?

Could patch be reduced to check after first `if
(!pg_is_sendpending())` ? like:
if (!pq_is_send_pending())
-		return;
+	{
+		if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			return;
+		}
+		if (now <= TimestampTzPlusMilliseconds(last_reply_timestamp, 
wal_sender_timeout / 2))
+			return;
+	}
If not, what problem prevents?
We should do CHECK_FOR_INTERRUPTS() independently of pq_is_send_pending
so that it's possible to stop walsender while it's processing large
transaction.

In this case CHECK_FOR_INTERRUPTS will be called after
wal_sender_timeout/2.
(This diff is for first appearance of `pq_is_send_pending` in a
function).

If it should be called more often, then patch could be simplier:

  	if (!pq_is_send_pending())
-		return;
+	{
+		CHECK_FOR_INTERRUPTS();
+		if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0 ||
+				now <= TimestampTzPlusMilliseconds(last_reply_timestamp, 
wal_sender_timeout / 2))
+			return;
+	}

(Still, I could be mistaken, it is just suggestion).

Is it hard to add test for case this patch fixes?

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Sokolov Yura (#8)

Re: Walsender timeouts and large transactions

On 2017-05-25 17:52 Petr Jelinek wrote:

Hi,

We have had issue with walsender timeout when used with logical
decoding and the transaction is taking long time to be decoded
(because it contains many changes)

I was looking today at the walsender code and realized that it's
because if the network and downstream are fast enough, we'll always
take fast path in WalSndWriteData which does not do reply or
keepalive processing and is only reached once the transaction has
finished by other code. So paradoxically we die of timeout because
everything was fast enough to never fall back to slow code path.

I propose we only use fast path if the last processed reply is not
older than half of walsender timeout, if it is then we'll force the
slow code path to process the replies again. This is similar logic
that we use to determine if to send keepalive message. I also added
CHECK_INTERRUPRS call to fast code path because otherwise walsender
might ignore them for too long on large transactions.

Thoughts?

On 2017-08-10 14:20 Sokolov Yura wrote:

On 2017-08-09 16:23, Petr Jelinek wrote:

On 02/08/17 19:35, Yura Sokolov wrote:

The following review has been posted through the commitfest
application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

There is no check for (last_reply_timestamp <= 0 ||
wal_sender_timeout <= 0) as in other places
(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is
correct:
there are calls to ProcessConfigFile with SyncRepInitConfig,
ProcessRepliesIfAny that can
last non-negligible time. It could lead to over sleeping due to
larger computed sleeptime.
Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body
loop after WalSndKeepaliveIfNecessary.
Is it necessary? But it looks harmless at least.

We also need to do actual timeout handing so that the timeout is not
deferred to the end of the transaction (Which is why I moved `if
(!pg_is_send_pending())` under WalSndCheckTimeOut() and
WalSndKeepaliveIfNecessary() calls).

If standby really stalled, then it will not read from socket, and then
`pg_is_sendpending` eventually will return false, and timeout will be
checked.
If standby doesn't stall, then `last_reply_timestamp` will be updated
in `ProcessRepliesIfAny`, and so timeout will not be triggered.
Do I understand correctly, or I missed something?
Could patch be reduced to check after first `if
(!pg_is_sendpending())` ? like:
if (!pq_is_send_pending())
-		return;
+	{
+		if (last_reply_timestamp <= 0 ||
wal_sender_timeout <= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			return;
+		}
+		if (now <=
TimestampTzPlusMilliseconds(last_reply_timestamp,
wal_sender_timeout / 2))
+			return;
+	}
If not, what problem prevents?
We should do CHECK_FOR_INTERRUPTS() independently of
pq_is_send_pending so that it's possible to stop walsender while
it's processing large transaction.
In this case CHECK_FOR_INTERRUPTS will be called after
wal_sender_timeout/2.
(This diff is for first appearance of `pq_is_send_pending` in a
function).

If it should be called more often, then patch could be simplier:
if (!pq_is_send_pending())
-		return;
+	{
+		CHECK_FOR_INTERRUPTS();
+		if (last_reply_timestamp <= 0 || wal_sender_timeout
<= 0 ||
+				now <=
TimestampTzPlusMilliseconds(last_reply_timestamp,
wal_sender_timeout / 2))
+			return;
+	}
(Still, I could be mistaken, it is just suggestion).

Is it hard to add test for case this patch fixes?

With regards,

Tom, Robert,

I believe this bug have to be fixed in Pg10, and I don't wonna
be that guy who prevents it from being fixed.
What should/could I do?
( https://commitfest.postgresql.org/14/1151/ )

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Sokolov Yura (#9)

Re: Walsender timeouts and large transactions

I've changed to "need review" to gain more attention from other.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Sokolov Yura (#10)

Re: Walsender timeouts and large transactions

Hello,

At Wed, 06 Sep 2017 13:46:16 +0000, Yura Sokolov <funny.falcon@postgrespro.ru> wrote in <20170906134616.18925.88390.pgcf@coridan.postgresql.org>

I've changed to "need review" to gain more attention from other.

I understand that the problem here is too fast network prohibits
walsender from sending replies.

In physical replication, WAL records are sent as soon as written
and the timeout is handled in the topmost loop in WalSndLoop. In
logical replication, data is sent at once at commit time in most
cases. So it can take a long time in ReorderBufferCommit without
returning to WalSndLoop (or even XLogSendLogical).

One problem here is that WalSndWriteData waits for the arrival of
the next *WAL record* in the slow-ptah because it is called by
cues of ReorderBuffer* functions (mainly *Commit) irrelevantly to
WAL insertion. This is I think the root cause of this problem.

On the other hand, it ought to take a sleep when network is
stalled, in other words, data to send remains after a flush. We
don't have a means to signal when the socket queue gets a new
room for another bytes. However, I suppose that such slow network
allows us to sleep several or several tens of milliseconds. Or,
if we could know how many bytes ps_flush_if_writable() pushed,
it's enough to wait only when the function returns pushing
nothing.

As the result, I think that the functions should be modified as
the following.

- Forcing slow-path if time elapses a half of a ping period is
right. (GetCurrentTimestamp is anyway requried.)

- The slow-path should not sleep waiting Latch. It should just
pg_usleep() for maybe 1-10ms.

- We should go to the fast path just after keepalive or response
message has been sent. In other words, the "if (now <" block
should be in the "for (;;)" loop. This avoids needless runs on
the slow-path.

It would be refactorable as the following.

prepare for the send buffer;

for (;;)
{
now = GetCurrentTimeStamp();
if (now < )...
{
fast-path
}
else
{
slow-path
}
return if finished
sleep for 1ms?
}

What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: Walsender timeouts and large transactions

On 2017-09-12 11:28, Kyotaro HORIGUCHI wrote:

Hello,

At Wed, 06 Sep 2017 13:46:16 +0000, Yura Sokolov
<funny.falcon@postgrespro.ru> wrote in
<20170906134616.18925.88390.pgcf@coridan.postgresql.org>

I've changed to "need review" to gain more attention from other.

I understand that the problem here is too fast network prohibits
walsender from sending replies.

In physical replication, WAL records are sent as soon as written
and the timeout is handled in the topmost loop in WalSndLoop. In
logical replication, data is sent at once at commit time in most
cases. So it can take a long time in ReorderBufferCommit without
returning to WalSndLoop (or even XLogSendLogical).

One problem here is that WalSndWriteData waits for the arrival of
the next *WAL record* in the slow-ptah because it is called by
cues of ReorderBuffer* functions (mainly *Commit) irrelevantly to
WAL insertion. This is I think the root cause of this problem.

On the other hand, it ought to take a sleep when network is
stalled, in other words, data to send remains after a flush. We
don't have a means to signal when the socket queue gets a new
room for another bytes. However, I suppose that such slow network
allows us to sleep several or several tens of milliseconds. Or,
if we could know how many bytes ps_flush_if_writable() pushed,
it's enough to wait only when the function returns pushing
nothing.

As the result, I think that the functions should be modified as
the following.

- Forcing slow-path if time elapses a half of a ping period is
right. (GetCurrentTimestamp is anyway requried.)

- The slow-path should not sleep waiting Latch. It should just
pg_usleep() for maybe 1-10ms.

- We should go to the fast path just after keepalive or response
message has been sent. In other words, the "if (now <" block
should be in the "for (;;)" loop. This avoids needless runs on
the slow-path.

It would be refactorable as the following.

prepare for the send buffer;

for (;;)
{
now = GetCurrentTimeStamp();
if (now < )...
{
fast-path
}
else
{
slow-path
}
return if finished
sleep for 1ms?
}

What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Good day, Petr, Kyotaro

I've created failing test for issue (0001-Add-failing-test...) .
It tests insertion of 20000 rows with 10ms wal_sender_timeout
(it fails in WalSndWriteData on master) and then deletion of
those rows with 1ms wal_sender_timeout (it fails in WalSndLoop).

Both Peter's patch and my simplified suggestion didn't pass the
test. I didn't checked Kyotaro's suggestion, though, cause I
didn't understand it well.

I've made patch that passes the test (0002-Fix-walsender...) .
(I've used Petr's commit message. Don't you mind, Petr?)

In WalSndWriteData it adds CHECK_FOR_INTERRUPTS to fastpath and
falls through to slow path after half of wal_sender_timeout as
were discussed.
In a slow path, it just skips fast exit on `!pq_is_send_pending()`
and check for timeout for the first loop iteration. And it sets
sleeptime to 1ms even if timeout were reached. It gives a chance
to receiver's response to arrive.

In WalSndLoop it also skips check for timeout first iteration after
send_data were called, and also sleeps at least 1 ms.

I'm not sure about correctness of my patch. Given test exists,
you may suggest better solutions, or improve this solution.

I'll set commitfest topic status to 'Need review' assuming
my patch could be reviewed.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

#13

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Sokolov Yura (#12)

Re: Walsender timeouts and large transactions

Hello,

At Wed, 27 Sep 2017 14:28:37 +0300, Sokolov Yura <funny.falcon@postgrespro.ru> wrote in <90bb67da7131e6186b50897c4b0f0ec3@postgrespro.ru>

On 2017-09-12 11:28, Kyotaro HORIGUCHI wrote:

Hello,
At Wed, 06 Sep 2017 13:46:16 +0000, Yura Sokolov
<funny.falcon@postgrespro.ru> wrote in
<20170906134616.18925.88390.pgcf@coridan.postgresql.org>
As the result, I think that the functions should be modified as
the following.
- Forcing slow-path if time elapses a half of a ping period is
right. (GetCurrentTimestamp is anyway requried.)
- The slow-path should not sleep waiting Latch. It should just
pg_usleep() for maybe 1-10ms.
- We should go to the fast path just after keepalive or response
message has been sent. In other words, the "if (now <" block
should be in the "for (;;)" loop. This avoids needless runs on
the slow-path.
It would be refactorable as the following.
prepare for the send buffer;
for (;;)
{
now = GetCurrentTimeStamp();
if (now < )...
{
fast-path
}
else
{
slow-path
}
return if finished
sleep for 1ms?
}
What do you think about this?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center

Good day, Petr, Kyotaro

I've created failing test for issue (0001-Add-failing-test...) .
It tests insertion of 20000 rows with 10ms wal_sender_timeout
(it fails in WalSndWriteData on master) and then deletion of
those rows with 1ms wal_sender_timeout (it fails in WalSndLoop).

Both Peter's patch and my simplified suggestion didn't pass the
test. I didn't checked Kyotaro's suggestion, though, cause I
didn't understand it well.

Mmm. The test seems broken. wal_sender_timeout = 10ms with
wal_receiver_status_interval=10s immediately causes a
timeout. Avoiding the timeout is just breaking the sane code.

wal_sender_timeout = 3s and wal_receiver_status_interval=1s
effectively causes the problem with about 1000000 lines of (int)
insertion on UNIX socket connection, on my poor box.

The original complain here came from the fact that
WalSndWriteData skips processing of replies for a long time on a
fast network. However Petr's patch fixed the problem, I pointed
that just letting the function take the slow path leads to
another problem, that is, waiting for new WAL records can result
in a unwanted pause in the slow path.

Combining the solutions for the two problem is my proposal sited
above. The sentence seems in quite bad style but the attached
file is the concorete patch of that.

Any thoughts?

regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center

#14

Sokolov Yura

funny.falcon@postgrespro.ru

over 8 years ago

In reply to: Kyotaro Horiguchi (#13)

Re: Walsender timeouts and large transactions

Good day, Kyoutaro

On 2017-09-29 11:26, Kyotaro HORIGUCHI wrote:

Hello,

At Wed, 27 Sep 2017 14:28:37 +0300, Sokolov Yura
<funny.falcon@postgrespro.ru> wrote in
<90bb67da7131e6186b50897c4b0f0ec3@postgrespro.ru>

On 2017-09-12 11:28, Kyotaro HORIGUCHI wrote:

Hello,
At Wed, 06 Sep 2017 13:46:16 +0000, Yura Sokolov
<funny.falcon@postgrespro.ru> wrote in
<20170906134616.18925.88390.pgcf@coridan.postgresql.org>
As the result, I think that the functions should be modified as
the following.
- Forcing slow-path if time elapses a half of a ping period is
right. (GetCurrentTimestamp is anyway requried.)
- The slow-path should not sleep waiting Latch. It should just
pg_usleep() for maybe 1-10ms.
- We should go to the fast path just after keepalive or response
message has been sent. In other words, the "if (now <" block
should be in the "for (;;)" loop. This avoids needless runs on
the slow-path.
It would be refactorable as the following.
prepare for the send buffer;
for (;;)
{
now = GetCurrentTimeStamp();
if (now < )...
{
fast-path
}
else
{
slow-path
}
return if finished
sleep for 1ms?
}
What do you think about this?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center

Good day, Petr, Kyotaro

I've created failing test for issue (0001-Add-failing-test...) .
It tests insertion of 20000 rows with 10ms wal_sender_timeout
(it fails in WalSndWriteData on master) and then deletion of
those rows with 1ms wal_sender_timeout (it fails in WalSndLoop).

Both Peter's patch and my simplified suggestion didn't pass the
test. I didn't checked Kyotaro's suggestion, though, cause I
didn't understand it well.

Mmm. The test seems broken. wal_sender_timeout = 10ms with
wal_receiver_status_interval=10s immediately causes a
timeout. Avoiding the timeout is just breaking the sane code.

I think you're not right. `wal_receiver_status_interval` is just
for status update, not for replies. Before I made my patch version,
I've added logging to every `now` and `last_reply_timestamp` during
test run. `last_reply_timestamp` definitely updates more often than
once in 10s. So, `wal_receiver_status_interval = 10s` has nothing
in common with receiver's replies, as I see.

(btw, logging slows down sender enough to "fix" test :-) )

And my patch doesn't avoid timeout check, so it doesn't break
anything. It just delays timeout on 1ms. Given, it is unpractical
to set wal_sender_timeout less than 50ms in production, this 1ms
increase in timeout check is negligible.

And I've checked just now that my patch passes test even with
wal_receiver_status_interval = 10s.

wal_sender_timeout = 3s and wal_receiver_status_interval=1s
effectively causes the problem with about 1000000 lines of (int)
insertion on UNIX socket connection, on my poor box.

I don't want to make test to lasts so long and generate so many data.
That is why I used such small timeouts for tests.

The original complain here came from the fact that
WalSndWriteData skips processing of replies for a long time on a
fast network. However Petr's patch fixed the problem, I pointed
that just letting the function take the slow path leads to
another problem, that is, waiting for new WAL records can result
in a unwanted pause in the slow path.

Combining the solutions for the two problem is my proposal sited
above. The sentence seems in quite bad style but the attached
file is the concorete patch of that.

Test is failing if there is "short quit" after `!pq_is_send_pending()`,
so I doubt your patch will pass the test.
And you've change calculated sleep time with sane waiting on all
insteresting events (using WaitLatchOrSocket) to semi-busy loop.
It at least could affect throughput.

And why did you remove `SetLatch(MyLatch)` in the end of function?
Probably this change is correct, but not obvious.

Any thoughts?

It certainly could be my test and my patch is wrong. But my point
is that test should be written first. Especially for such difficult
case. Without test it is impossible to say does our patches fix
something. And it is impossible to say if patch does something
wrong. And impossible to say if patch fixes this problem but
introduce new problem.

Please, write test for your remarks. If you think, my patch breaks
something, write test for the case my patch did broke. If you think
my test is wrong, write your test that is more correct.

Without tests it will be just bird's hubbub.

regards,

regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Sokolov Yura (#14)

Re: Walsender timeouts and large transactions

Hello Sokolov.

At Fri, 29 Sep 2017 15:19:23 +0300, Sokolov Yura <funny.falcon@postgrespro.ru> wrote in <d076dae18b437be89c787a854034f3f2@postgrespro.ru>

I don't want to make test to lasts so long and generate so many data.
That is why I used such small timeouts for tests.

I understand your point, but still *I* think such a short timeout
is out of expectation by design. (But it can be set.)

Does anyone have opinions on this?

Test is failing if there is "short quit" after
`!pq_is_send_pending()`,
so I doubt your patch will pass the test.

It is because I think that the test "should" fail since the
timeout is out of expected range. I (and perhaps also Petr) is
thinking that the problem is just that a large transaction causes
a timeout with an ordinary timeout. My test case is based on the
assumption.

Your test is for a timeout during replication-startup with
extremely short timeout. This may be a different problem to
discuss, but perhaps better to be solved together.

I'd like to have opnions from others on this point.

And you've change calculated sleep time with sane waiting on all
insteresting events (using WaitLatchOrSocket) to semi-busy loop.
It at least could affect throughput.

Uggh! I misunderstood there. It wais for writing socket so the
sleep is wrong and WaitLatchOrSocket is right.

After all, I put +1 for Petr's latest patch. Sorry for my
carelessness.

And why did you remove `SetLatch(MyLatch)` in the end of function?
Probably this change is correct, but not obvious.

Latch is needless there if it waited a fixed duration, but if it
waits writefd events there, also latch should be waited.

Any thoughts?

It certainly could be my test and my patch is wrong. But my point
is that test should be written first. Especially for such difficult
case. Without test it is impossible to say does our patches fix
something. And it is impossible to say if patch does something
wrong. And impossible to say if patch fixes this problem but
introduce new problem.

Please, write test for your remarks. If you think, my patch breaks
something, write test for the case my patch did broke. If you think
my test is wrong, write your test that is more correct.

Without tests it will be just bird's hubbub.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Petr Jelinek

petr@2ndquadrant.com

over 8 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Walsender timeouts and large transactions

Hi,

sorry for the delay but I didn't have much time in past weeks to follow
this thread.

On 02/10/17 05:44, Kyotaro HORIGUCHI wrote:

Hello Sokolov.

At Fri, 29 Sep 2017 15:19:23 +0300, Sokolov Yura <funny.falcon@postgrespro.ru> wrote in <d076dae18b437be89c787a854034f3f2@postgrespro.ru>

I don't want to make test to lasts so long and generate so many data.
That is why I used such small timeouts for tests.

I understand your point, but still *I* think such a short timeout
is out of expectation by design. (But it can be set.)

Does anyone have opinions on this?

Yes, it's not practically useful to have such small wal_sender_timeout
given that the main purpose of that is to detect network issues.

Test is failing if there is "short quit" after
`!pq_is_send_pending()`,
so I doubt your patch will pass the test.

It is because I think that the test "should" fail since the
timeout is out of expected range. I (and perhaps also Petr) is
thinking that the problem is just that a large transaction causes
a timeout with an ordinary timeout. My test case is based on the
assumption.

Your test is for a timeout during replication-startup with
extremely short timeout. This may be a different problem to
discuss, but perhaps better to be solved together.

I'd like to have opnions from others on this point.

I think it's different problem and because of what I wrote above it does
not seem to me like something we should spend out time on trying to fix.

Uggh! I misunderstood there. It wais for writing socket so the
sleep is wrong and WaitLatchOrSocket is right.

After all, I put +1 for Petr's latest patch. Sorry for my
carelessness.

Great, I attached version 3 which is just rebased on current master and
also does not move the GetCurrentTimestamp() call because the concern
about skewing the timestamp during config reload (and also network flush
as I realized later) is valid.

It's rather hard to test all the scenarios that this patch fixes in
automated fashion without generating a lot of wal or adding sleeps to
the processing. That's why I didn't produce usable TAP test.

Since it seems like some of my reasoning is unclear I will try to
describe it once more.

The main problem we have is that unless we call the
ProcessRepliesIfAny() before the wal_sender_timeout expires we'll die
because of timeout eventually. The current coding will skip that call if
there is a long transaction being processed (if network is fast enough).
This is what the first part (first 2 hunks) of the patch solves. There
is also issue that while this is happening the walsender ignores signals
so it's impossible to stop it which is why I added the
CHECK_FOR_INTERRUPTS to the fast path.

The second problem is that if we solved just the first one, then if
downstream (and again network) is fast enough, the ProcessRepliesIfAny()
will not do anything useful because downstream is not going to send any
response while the network buffer contains any data. This is caused by
the fact that we normally code the receiver side to receive until there
is something and only send reply when there is a "pause" in the stream.
To get around this problem we also need to make sure that we send
WalSndKeepaliveIfNecessary() periodically and that will not happen on
fast network unless we do the second part of the patch (3rd hunk), ie,
move the pq_is_send_pending() after the keepalive handling.

This code is specific to logical decoding walsender interface so it only
happens for logical decoding/replication (which means it should be
backported all the way to 9.4). The physical one

These two issues happen quite normally in the wild as all we need is big
data load in single transaction, or update of large part of an table or
something similar for this to happen with default settings.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#16)

Re: Walsender timeouts and large transactions

On Wed, Nov 1, 2017 at 8:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

sorry for the delay but I didn't have much time in past weeks to follow
this thread.

+    TimestampTz now = GetCurrentTimestamp();
+
     /* output previously gathered data in a CopyData packet */
     pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);

     /*
      * Fill the send timestamp last, so that it is taken as late as possible.
      * This is somewhat ugly, but the protocol is set as it's already used for
      * several releases by streaming physical replication.
      */
     resetStringInfo(&tmpbuf);
-    pq_sendint64(&tmpbuf, GetCurrentTimestamp());
+    pq_sendint64(&tmpbuf, now);
     memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
            tmpbuf.data, sizeof(int64));

This change falsifies the comments. Maybe initialize now just after
resetSetringInfo() is done.

-    /* fast path */
-    /* Try to flush pending output to the client */
-    if (pq_flush_if_writable() != 0)
-        WalSndShutdown();
+    /* Try taking fast path unless we get too close to walsender timeout. */
+    if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
+                                          wal_sender_timeout / 2))
+    {
+        CHECK_FOR_INTERRUPTS();

-    if (!pq_is_send_pending())
-        return;
+        /* Try to flush pending output to the client */
+        if (pq_flush_if_writable() != 0)
+            WalSndShutdown();
+
+        if (!pq_is_send_pending())
+            return;
+    }

I think it's only the if (!pq_is_send_pending()) return; that needs to
be conditional here, isn't it? The pq_flush_if_writable() can be done
unconditionally.

Other than that this looks like a reasonable change to me, but I'm not
an expert on this code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Petr Jelinek

petr@2ndquadrant.com

over 8 years ago

In reply to: Robert Haas (#17)

Re: Walsender timeouts and large transactions

Hi,

thanks for checking.

On 02/11/17 10:00, Robert Haas wrote:

On Wed, Nov 1, 2017 at 8:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

sorry for the delay but I didn't have much time in past weeks to follow
this thread.
+    TimestampTz now = GetCurrentTimestamp();
+
/* output previously gathered data in a CopyData packet */
pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
/*
* Fill the send timestamp last, so that it is taken as late as possible.
* This is somewhat ugly, but the protocol is set as it's already used for
* several releases by streaming physical replication.
*/
resetStringInfo(&tmpbuf);
-    pq_sendint64(&tmpbuf, GetCurrentTimestamp());
+    pq_sendint64(&tmpbuf, now);
memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
tmpbuf.data, sizeof(int64));
This change falsifies the comments. Maybe initialize now just after
resetSetringInfo() is done.

Eh, right, I can do that.

-    /* fast path */
-    /* Try to flush pending output to the client */
-    if (pq_flush_if_writable() != 0)
-        WalSndShutdown();
+    /* Try taking fast path unless we get too close to walsender timeout. */
+    if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
+                                          wal_sender_timeout / 2))
+    {
+        CHECK_FOR_INTERRUPTS();

-    if (!pq_is_send_pending())
-        return;
+        /* Try to flush pending output to the client */
+        if (pq_flush_if_writable() != 0)
+            WalSndShutdown();
+
+        if (!pq_is_send_pending())
+            return;
+    }

I think it's only the if (!pq_is_send_pending()) return; that needs to
be conditional here, isn't it? The pq_flush_if_writable() can be done
unconditionally.

Well, even the CHECK_FOR_INTERRUPTS() can be called unconditionally yes.
It just seems like it's needless call as we'll call both in for loop
anyway if we take the "slow" path. I admit it's not exactly big win
though. If you think it would improve readability I can move it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#18)

Re: [HACKERS] Walsender timeouts and large transactions

Hello,

At Fri, 3 Nov 2017 15:54:09 +0100, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <e2939d26-f5cb-6581-0ca3-a1b0556ed729@2ndquadrant.com>

Hi,

thanks for checking.

On 02/11/17 10:00, Robert Haas wrote:
On Wed, Nov 1, 2017 at 8:19 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

sorry for the delay but I didn't have much time in past weeks to follow
this thread.
+    TimestampTz now = GetCurrentTimestamp();
+
/* output previously gathered data in a CopyData packet */
pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
/*
* Fill the send timestamp last, so that it is taken as late as possible.
* This is somewhat ugly, but the protocol is set as it's already used for
* several releases by streaming physical replication.
*/
resetStringInfo(&tmpbuf);
-    pq_sendint64(&tmpbuf, GetCurrentTimestamp());
+    pq_sendint64(&tmpbuf, now);
memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
tmpbuf.data, sizeof(int64));
This change falsifies the comments. Maybe initialize now just after
resetSetringInfo() is done.
Eh, right, I can do that.
-    /* fast path */
-    /* Try to flush pending output to the client */
-    if (pq_flush_if_writable() != 0)
-        WalSndShutdown();
+    /* Try taking fast path unless we get too close to walsender timeout. */
+    if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
+                                          wal_sender_timeout / 2))
+    {
+        CHECK_FOR_INTERRUPTS();
-    if (!pq_is_send_pending())
-        return;
+        /* Try to flush pending output to the client */
+        if (pq_flush_if_writable() != 0)
+            WalSndShutdown();
+
+        if (!pq_is_send_pending())
+            return;
+    }
I think it's only the if (!pq_is_send_pending()) return; that needs to
be conditional here, isn't it? The pq_flush_if_writable() can be done
unconditionally.
Well, even the CHECK_FOR_INTERRUPTS() can be called unconditionally yes.
It just seems like it's needless call as we'll call both in for loop
anyway if we take the "slow" path. I admit it's not exactly big win
though. If you think it would improve readability I can move it.

I think this is the last message in this thread so I changed the
status of the CF entry to "Waiting for Author".

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Kyotaro Horiguchi (#19)

Re: [HACKERS] Walsender timeouts and large transactions

Hello,

At Fri, 17 Nov 2017 13:24:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171117.132408.85564852.horiguchi.kyotaro@lab.ntt.co.jp>

Well, even the CHECK_FOR_INTERRUPTS() can be called unconditionally yes.
It just seems like it's needless call as we'll call both in for loop
anyway if we take the "slow" path. I admit it's not exactly big win
though. If you think it would improve readability I can move it.

I think this is the last message in this thread so I changed the
status of the CF entry to "Waiting for Author".

Hmm. Somehow the last patch and Robert's comment * which is the
base of the patch * has been reached me in reverse order.

I found that the patch is the latest one and will look this
soon. Sorry for the ignorance.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 8 years ago

In reply to: Petr Jelinek (#18)

#22