WIP: long transactions on hot standby feedback replica / proof of concept

Started by Kartyshov Ivanover 8 years ago45 messageshackers
Jump to latest
#1Kartyshov Ivan
i.kartyshov@postgrespro.ru

Our clients complain about this issue and therefore I want to raise the
discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises
lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at
the same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on
master while backend on replica is performing a long transaction
involving the same table tbl1. This happens because truncate takes an
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas
have finished their transactions (in this case, feedback requests to the
master should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch

2. Maybe there is an option somehow not to send AccessExclusiveLock and
not to truncate table on the replica right away. We could try to wait a
little and truncate tbl1 on replica again.

Here is a patch that makes replica skip truncate WAL record if some
transaction using the same table is already running on replica. And it
also forces master to send truncate_wal to replica with actual table
length every time autovacuum starts.
Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be
interrupted because of truncate. Even if for some reason we haven’t
truncated this table tbl1 right away, nothing terrible will happen. The
next truncate wal record will reduce the file size by the actual length
(if some inserts or updates have been performed on master).

If you have any ideas or concerns about suggested solution I’ll be glad
to hear them.
Thanks!

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachments:

vacuum_lazy_truncate.patchtext/x-diff; name=vacuum_lazy_truncate.patchDownload+2-1
standby_truncate_skip_v1.patchtext/x-diff; name=standby_truncate_skip_v1.patchDownload+49-3
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Kartyshov Ivan (#1)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Sep 4, 2017 at 4:34 PM, <i.kartyshov@postgrespro.ru> wrote:

Our clients complain about this issue and therefore I want to raise the
discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises
lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at the
same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on
master while backend on replica is performing a long transaction involving
the same table tbl1. This happens because truncate takes an
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas have
finished their transactions (in this case, feedback requests to the master
should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch

2. Maybe there is an option somehow not to send AccessExclusiveLock and not
to truncate table on the replica right away. We could try to wait a little
and truncate tbl1 on replica again.

Can max_standby_streaming_delay help in this situation (point number - 2)?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Amit Kapila (#2)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
Sent: Monday, September 4, 2017 3:32 PM
To: i.kartyshov@postgrespro.ru
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Sep 4, 2017 at 4:34 PM, <i.kartyshov@postgrespro.ru> wrote:

Our clients complain about this issue and therefore I want to raise
the discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that
rises lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at
the same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1
on master while backend on replica is performing a long transaction
involving the same table tbl1. This happens because truncate takes an
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas
have finished their transactions (in this case, feedback requests to
the master should be sent frequently) Patch 1
vacuum_lazy_truncate.patch

2. Maybe there is an option somehow not to send AccessExclusiveLock
and not to truncate table on the replica right away. We could try to
wait a little and truncate tbl1 on replica again.

Can max_standby_streaming_delay help in this situation (point number - 2)?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Hello!
In this situation this parameter (max_standby_streaming_delay) wont help because if you have subsequent statement on standby (following info is from documentation and from our experience ): Thus, if one query has resulted in significant delay, subsequent conflicting queries will have much less grace time until the standby server has caught up again. And you never now how to set this parameter exept to -1 which mean up to infinity delayed standby.

On our experience only autovacuum on master took AccesExclusiveLock that raise this Fatal message on standby. After this AccessExclusive reached standby and max_standby_streaming_delay > -1 you definitely sooner or later get this Fatal on recovery .
With this patch we try to get rid of AccessEclusiveLock applied on standby while we have active statement on it.

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alex Ignatov (#3)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Sep 4, 2017 at 4:51 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

In this situation this parameter (max_standby_streaming_delay) wont help
because if you have subsequent statement on standby (following info is from
documentation and from our experience ): Thus, if one query has resulted in
significant delay, subsequent conflicting queries will have much less grace
time until the standby server has caught up again. And you never now how to
set this parameter exept to -1 which mean up to infinity delayed standby.

On our experience only autovacuum on master took AccesExclusiveLock that
raise this Fatal message on standby. After this AccessExclusive reached
standby and max_standby_streaming_delay > -1 you definitely sooner or
later get this Fatal on recovery .
With this patch we try to get rid of AccessEclusiveLock applied on standby
while we have active statement on it.

+1,
Aborting read-only query on standby because of vacuum on master seems to be
rather unexpected behaviour for user when hot standby feedback is on.
I think we should work on this problem for v11. Backpatching documentation
for previous releases would be good too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#5Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#1)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Sep 4, 2017 at 2:04 PM, <i.kartyshov@postgrespro.ru> wrote:

Our clients complain about this issue and therefore I want to raise the
discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises
lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at the
same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on
master while backend on replica is performing a long transaction involving
the same table tbl1. This happens because truncate takes an
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas have
finished their transactions (in this case, feedback requests to the master
should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch

I've following comments on this patch:
1) You shouldn't use ">=" to compare xids. You should use
TransactionIdFollowsOrEquals()
function which handles transaction id wraparound correctly.
2) As I understood, this patch makes heap truncate only when no concurrent
transactions are running on both master and replicas with
hot_standby_feedback enabled. For busy system, it would be literally
"never do heap truncate after vacuum".

2. Maybe there is an option somehow not to send AccessExclusiveLock and not

to truncate table on the replica right away. We could try to wait a little
and truncate tbl1 on replica again.

Here is a patch that makes replica skip truncate WAL record if some
transaction using the same table is already running on replica. And it also
forces master to send truncate_wal to replica with actual table length
every time autovacuum starts.
Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be
interrupted because of truncate. Even if for some reason we haven’t
truncated this table tbl1 right away, nothing terrible will happen. The
next truncate wal record will reduce the file size by the actual length (if
some inserts or updates have been performed on master).

Since you wrote this patch under on my request, let me clarify its idea
little more.

Currently, lazy_truncate_heap() is very careful on getting
AccessExclusiveLock to truncate heap. It doesn't want either block other
transaction or wait for this lock too long. If lock isn't acquired after
some number of tries lazy_truncate_heap() gives up. However, once
lazy_truncate_heap() acquires AccessExclusiveLock is acquired on master, it
would be replayed on replicas where it will conflict with read-only
transactions if any. That leads to unexpected behaviour when
hot_standby_feedback is on.

Idea of fixing that is to apply same logic of getting AccessExclusiveLock
on standby as on master: give up if somebody is holding conflicting lock
for long enough. That allows standby to have more free pages at the end of
heap than master have. That shouldn't lead to errors since those pages are
empty, but allows standby to waste some extra space. In order to mitigate
this deficiency, we're generating XLOG_SMGR_TRUNCATE records more frequent:
on finish of every vacuum. Therefore, if even standby gets some extra
space of empty pages, it would be corrected during further vacuum cycles.

@@ -397,7 +425,6 @@ lazy_vacuum_rel(Relation onerel, int options,

VacuumParams *params,
appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate:
%.3f MB/s\n"),
read_rate, write_rate);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
-
ereport(LOG,
(errmsg_internal("%s", buf.data)));
pfree(buf.data);
@@ -1820,7 +1847,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats
*vacrelstats)
vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
vacrelstats->rel_pages = new_rel_pages;

- ereport(elevel,
+ ereport(WARNING,
(errmsg("\"%s\": truncated %u to %u pages",
RelationGetRelationName(onerel),
old_rel_pages, new_rel_pages),

Ivan, what these changes are supposed to do?

diff --git a/src/backend/storage/lmgr/lock.c
b/src/backend/storage/lmgr/lock.c
index 2b26173..f04888e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -816,7 +816,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
XLogStandbyInfoActive())
{
LogAccessExclusiveLockPrepare();
- log_lock = true;
+// log_lock = true;
}

/*

Perhaps, it's certainly bad idea to disable replaying AccessExclusiveLock
*every time*, we should skip taking AccessExclusiveLock only while writing
XLOG_SMGR_TRUNCATE record. Besides that, this code violates out coding
style.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#5)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Sep 5, 2017 at 3:03 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Mon, Sep 4, 2017 at 2:04 PM, <i.kartyshov@postgrespro.ru> wrote:

Our clients complain about this issue and therefore I want to raise the
discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises
lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at the
same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on
master while backend on replica is performing a long transaction involving
the same table tbl1. This happens because truncate takes an
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas have
finished their transactions (in this case, feedback requests to the master
should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch

I've following comments on this patch:
1) You shouldn't use ">=" to compare xids. You should use
TransactionIdFollowsOrEquals() function which handles transaction id
wraparound correctly.
2) As I understood, this patch makes heap truncate only when no concurrent
transactions are running on both master and replicas with
hot_standby_feedback enabled. For busy system, it would be literally "never
do heap truncate after vacuum".

2. Maybe there is an option somehow not to send AccessExclusiveLock and
not to truncate table on the replica right away. We could try to wait a
little and truncate tbl1 on replica again.

Here is a patch that makes replica skip truncate WAL record if some
transaction using the same table is already running on replica. And it also
forces master to send truncate_wal to replica with actual table length every
time autovacuum starts.
Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be
interrupted because of truncate. Even if for some reason we haven’t
truncated this table tbl1 right away, nothing terrible will happen. The next
truncate wal record will reduce the file size by the actual length (if some
inserts or updates have been performed on master).

Since you wrote this patch under on my request, let me clarify its idea
little more.

Currently, lazy_truncate_heap() is very careful on getting
AccessExclusiveLock to truncate heap. It doesn't want either block other
transaction or wait for this lock too long. If lock isn't acquired after
some number of tries lazy_truncate_heap() gives up. However, once
lazy_truncate_heap() acquires AccessExclusiveLock is acquired on master, it
would be replayed on replicas where it will conflict with read-only
transactions if any. That leads to unexpected behaviour when
hot_standby_feedback is on.

Idea of fixing that is to apply same logic of getting AccessExclusiveLock on
standby as on master: give up if somebody is holding conflicting lock for
long enough. That allows standby to have more free pages at the end of heap
than master have. That shouldn't lead to errors since those pages are
empty, but allows standby to waste some extra space. In order to mitigate
this deficiency, we're generating XLOG_SMGR_TRUNCATE records more frequent:
on finish of every vacuum. Therefore, if even standby gets some extra space
of empty pages, it would be corrected during further vacuum cycles.

I think one deficiency of this solution is that it will keep on
generating extra WAL even if standby doesn't need it (as standby has
successfully truncated the relation). I don't know if we can just get
away by saying that an additional WAL record per vacuum cycle is
harmless especially when that doesn't serve any purpose (like for the
cases when standby is always able to truncate the heap on first WAL
record). Am I missing some part of solution which avoids it?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Alexander Korotkov (#4)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On 4 September 2017 at 09:06, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Aborting read-only query on standby because of vacuum on master seems to be
rather unexpected behaviour for user when hot standby feedback is on.
I think we should work on this problem for v11.

Happy to help. My suggestion would be to discuss a potential theory of
operation and then code a patch.

As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

Perhaps altering the way truncation requires an AccessExclusiveLock
would help workloads on both master and standby? If a Relation knew it
had a pending truncation then scans wouldn't need to go past newsize.
Once older lockers have gone we could simply truncate without the
lock. Would need a few pushups but seems less scary then trying to
make pending standby actions work well enough to commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#5)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

Alexander Korotkov писал 2017-09-05 00:33:

I've following comments on this patch:
1) You shouldn't use ">=" to compare xids. You should use
TransactionIdFollowsOrEquals() function which handles transaction id
wraparound correctly.

I fixed it and as an additional I add GUC parameter that could turn off
this behavior. These parameter can be set in the postgresql.conf
configuration file, or with the help of the ALTER SYSTEM command. For
the changes to take effect, call the pg_reload_conf() function or reload
the database server.
Changes are in vacuum_lazy_truncate_v3.patch

2) As I understood, this patch makes heap truncate only when no
concurrent transactions are running on both master and replicas with
hot_standby_feedback enabled. For busy system, it would be literally
"never do heap truncate after vacuum".

Yes, the difficulty is that if we have a lot of replicas and requests
for them will overlap each other, then truncation will not occur on the
loaded system

Perhaps, it's certainly bad idea to disable replaying
AccessExclusiveLock *every time*, we should skip taking
AccessExclusiveLock only while writing XLOG_SMGR_TRUNCATE record.
Besides that, this code violates out coding style.

In patch (standby_truncate_skip_v2.patch) fixed this behaviour and now
it skip writing XLOG_STANDBY_LOCK records (with in AccessExclusiveLock)
only while autovacuum truncations the table.

Amit Kapila писал 2017-09-05 12:52:

I think one deficiency of this solution is that it will keep on
generating extra WAL even if standby doesn't need it (as standby has
successfully truncated the relation). I don't know if we can just get
away by saying that an additional WAL record per vacuum cycle is
harmless especially when that doesn't serve any purpose (like for the
cases when standby is always able to truncate the heap on first WAL
record). Am I missing some part of solution which avoids it?

You are right, and this implementation generating extra WALs and it
could have some cases. If you have any solutions to avoid it, I`ll be
glade to hear them.

Simon Riggs писал 2017-09-05 14:44:

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

Perhaps altering the way truncation requires an AccessExclusiveLock
would help workloads on both master and standby? If a Relation knew it
had a pending truncation then scans wouldn't need to go past newsize.
Once older lockers have gone we could simply truncate without the
lock. Would need a few pushups but seems less scary then trying to
make pending standby actions work well enough to commit.

Yes, it is the first idea that come to my mind and the most difficult
one. I think that in a few days I'll finish the patch with such ideas of
implementation.

------
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

standby_truncate_skip_v2.patchtext/x-diff; name=standby_truncate_skip_v2.patchDownload+58-3
vacuum_lazy_truncate_v3.patchtext/x-diff; name=vacuum_lazy_truncate_v3.patchDownload+23-2
#9Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#1)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

Hello. I made some bugfixes and rewrite the patch.

Simon Riggs писал 2017-09-05 14:44:

As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

After crash standby server will go to the nearest checkout and will
replay
all his wal`s to reach consistent state and then open for read-only
load.
(all collected wal`s will be replayed in right way, I meant wal`s that
truncates table and wal`s that make table grow)

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

standby_truncate_skip_v4.patchtext/x-diff; name=standby_truncate_skip_v4.patchDownload+86-45
#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#9)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Oct 24, 2017 at 10:56 AM, Ivan Kartyshov <i.kartyshov@postgrespro.ru

wrote:

Hello. I made some bugfixes and rewrite the patch.

Simon Riggs писал 2017-09-05 14:44:

As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

After crash standby server will go to the nearest checkout and will replay
all his wal`s to reach consistent state and then open for read-only load.
(all collected wal`s will be replayed in right way, I meant wal`s that
truncates table and wal`s that make table grow)

--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -49,6 +49,8 @@
#include "utils/ps_status.h"
#include "utils/resowner_private.h"

+#include <execinfo.h>
+#include <dlfcn.h>

/* This configuration variable is used to set the lock table size */
int max_locks_per_xact; /* set by guc.c */

Why do you need these includes?

+ FreeFakeRelcacheEntry(rel);

+ } else
+ {
+ XLogFlush(lsn);
+ }

This code violates our coding style. According to it, "else" shouldn't
share its line with braces.

Also, patch definitely needs some general comment explaining what's going
on.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11Robert Haas
robertmhaas@gmail.com
In reply to: Kartyshov Ivan (#9)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come back
to bite us.

Giving LockAcquireExtended() an argument that causes some
AccessExclusiveLocks not all to be logged also seems pretty ugly,
especially because there are no comments whatsoever explaining the
rationale.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Robert Haas (#11)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come back
to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Masahiko Sawada (#12)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come back
to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Yes, that looks dangerous. One approach to cope that could be teaching
heap redo function to handle such these situations. But I expect this
approach to be criticized for bad design. And I understand fairness of
this criticism.

However, from user prospective of view, current behavior of
hot_standby_feedback
is just broken, because it both increases bloat and doesn't guarantee that
read-only query on standby wouldn't be cancelled because of vacuum.
Therefore, we should be looking for solution: if one approach isn't good
enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on
hot standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it was
deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alexander Korotkov (#13)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come back
to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Yes, that looks dangerous. One approach to cope that could be teaching heap
redo function to handle such these situations. But I expect this approach
to be criticized for bad design. And I understand fairness of this
criticism.

However, from user prospective of view, current behavior of
hot_standby_feedback is just broken, because it both increases bloat and
doesn't guarantee that read-only query on standby wouldn't be cancelled
because of vacuum. Therefore, we should be looking for solution: if one
approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on hot
standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it was
deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

You also meant that the applying WAL for AccessExclusiveLock is always
skipped on standby servers to allow scans to access the relation?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#13)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Tue, Oct 31, 2017 at 2:47 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

However, from user prospective of view, current behavior of
hot_standby_feedback is just broken, because it both increases bloat and
doesn't guarantee that read-only query on standby wouldn't be cancelled
because of vacuum. Therefore, we should be looking for solution: if one
approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on hot
standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it was
deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

Sounds like it might break MVCC?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#15)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Thu, Nov 2, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 31, 2017 at 2:47 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

However, from user prospective of view, current behavior of
hot_standby_feedback is just broken, because it both increases bloat and
doesn't guarantee that read-only query on standby wouldn't be cancelled
because of vacuum. Therefore, we should be looking for solution: if one
approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on

hot

standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it

was

deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

Sounds like it might break MVCC?

I don't know why it might be broken. VACUUM truncates heap only when tail
to be truncated is already empty. When applying truncate WAL record,
previous WAL records deleting all those tuples in the tail are already
applied. Thus, if even MVCC is broken and we will miss some tuples after
heap truncation, they were anyway were gone before heap truncation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Masahiko Sawada (#14)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Wed, Nov 1, 2017 at 5:55 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come back
to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Yes, that looks dangerous. One approach to cope that could be teaching

heap

redo function to handle such these situations. But I expect this

approach

to be criticized for bad design. And I understand fairness of this
criticism.

However, from user prospective of view, current behavior of
hot_standby_feedback is just broken, because it both increases bloat and
doesn't guarantee that read-only query on standby wouldn't be cancelled
because of vacuum. Therefore, we should be looking for solution: if one
approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on

hot

standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it

was

deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

You also meant that the applying WAL for AccessExclusiveLock is always
skipped on standby servers to allow scans to access the relation?

Definitely not every AccessExclusiveLock WAL records should be skipped, but
only whose were emitted during heap truncation. There are other cases when
AccessExclusiveLock WAL records are emitted, for instance, during DDL
operations. But, I'd like to focus on AccessExclusiveLock WAL records
caused by VACUUM for now. It's kind of understandable for users that DDL
might cancel read-only query on standby. So, if you're running long report
query then you should wait with your DDL. But VACUUM is a different
story. It runs automatically when you do normal DML queries.

AccessExclusiveLock WAL records by VACUUM could be either not emitted, or
somehow distinguished and skipped on standby. I haven't thought out that
level of detail for now.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#18Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#16)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Fri, Nov 3, 2017 at 5:57 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Nov 2, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I can propose following alternative approach: teach read-only queries on
hot
standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it
was
deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

Sounds like it might break MVCC?

I don't know why it might be broken. VACUUM truncates heap only when tail
to be truncated is already empty. When applying truncate WAL record,
previous WAL records deleting all those tuples in the tail are already
applied. Thus, if even MVCC is broken and we will miss some tuples after
heap truncation, they were anyway were gone before heap truncation.

Ah - I was thinking of the TRUNCATE command, rather than truncation by
VACUUM. Your argument makes sense, although the case where the
relation is truncated and later re-extended might need some thought.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alexander Korotkov (#17)
Re: WIP: long transactions on hot standby feedback replica / proof of concept

On Sat, Nov 4, 2017 at 7:04 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Nov 1, 2017 at 5:55 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:

Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the master on the theory that it
won't matter. I feel like that's something that's likely to come
back
to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Yes, that looks dangerous. One approach to cope that could be teaching
heap
redo function to handle such these situations. But I expect this
approach
to be criticized for bad design. And I understand fairness of this
criticism.

However, from user prospective of view, current behavior of
hot_standby_feedback is just broken, because it both increases bloat and
doesn't guarantee that read-only query on standby wouldn't be cancelled
because of vacuum. Therefore, we should be looking for solution: if one
approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on
hot
standby to tolerate concurrent relation truncation. Therefore, when
non-existent heap page is accessed on hot standby, we can know that it
was
deleted by concurrent truncation and should be assumed to be empty. Any
thoughts?

You also meant that the applying WAL for AccessExclusiveLock is always
skipped on standby servers to allow scans to access the relation?

Definitely not every AccessExclusiveLock WAL records should be skipped, but
only whose were emitted during heap truncation. There are other cases when
AccessExclusiveLock WAL records are emitted, for instance, during DDL
operations. But, I'd like to focus on AccessExclusiveLock WAL records
caused by VACUUM for now. It's kind of understandable for users that DDL
might cancel read-only query on standby. So, if you're running long report
query then you should wait with your DDL. But VACUUM is a different story.
It runs automatically when you do normal DML queries.

AccessExclusiveLock WAL records by VACUUM could be either not emitted, or
somehow distinguished and skipped on standby. I haven't thought out that
level of detail for now.

I understood. I'm concerned the fact that we cannot distinguish that
AccessExclusiveLock WAL came from the vacuum truncation or other
operation required AccessExclusiveLock so far. So I agree that we need
to invent a way for that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Masahiko Sawada (#19)
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

Thank you for your valuable comments. I've made a few adjustments.

The main goal of my changes is to let long read-only transactions run on
replica if hot_standby_feedback is turned on.

Patch1 - hsfeedback_av_truncate.patch is made to stop
ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy
truncates heap on master cutting some pages at the end. When
hot_standby_feedback is on, we know that the autovacuum does not remove
anything superfluous, which could be needed on standby, so there is no
need to rise any ResolveRecoveryConflict*.

1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which
tells us that autovacuum generates them.

2) When autovacuum decides to trim the table (using lazy_truncate_heap),
it takes AccessExclusiveLock and sends this lock to the replica, but
replica should ignore AccessExclusiveLock if hot_standby_feedback=on.

3) When autovacuum truncate wal message is replayed on a replica, it
takes ExclusiveLock on a table, so as not to interfere with read-only
requests.

We have two cases of resolving ResolveRecoveryConflictWithLock if timers
(max_standby_streaming_delay and max_standby_archive_delay) have run
out:
backend is idle in transaction (waiting input) - in this case backend
will be sent SIGTERM
backend transaction is running query - in this case running transaction
will be aborted

How to test:
Make async replica, turn on feedback and reduce
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test1:
Here we will do a load on the master and simulation of a long
transaction with repeated 1 second SEQSCANS on the replica (by calling
pg_sleep 1 second duration every 6 seconds).
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT pg_sleep(value) FROM test;
\watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or
increase max_standby_streaming_delay.

Test2:
Here we will do a load on the master and simulation of a long
transaction on the replica (by taking LOCK on table)
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
LOCK TABLE test IN ACCESS SHARE MODE;
select * from test;
\watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Test3:
Here we do a load on the master and simulation of a long transaction
with repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1
second duration every 6 seconds).
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 4s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 200 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT pg_sleep(value) FROM test;

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

This way we can make transactions with SEQSCAN, INDEXSCAN or BITMAPSCAN

Patch2 - hsfeedback_noninvalide_xmin.patch
When walsender is initialized, its xmin in PROCARRAY is set to
GetOldestXmin() in order to prevent autovacuum running on master from
truncating relation and removing some pages that are required by
replica. This might happen if master's autovacuum and replica's query
started simultaneously. And the replica has not yet reported its xmin
value.

How to test:
Make async replica, turn on feedback, reduce max_standby_streaming_delay
and aggressive autovacuum.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test:
Here we will start replica and begi repeatable read transaction on 
table, then we stop replicas postmaster to prevent starting walreceiver 
worker (on master startup) and sending master it`s transaction xmin over 
hot_standby_feedback message.
MASTER        REPLICA
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM 
generate_series(1,10000000) id);
stop
     start
     BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
     SELECT * FROM test;
     stop postmaster with gdb
start
DELETE FROM test WHERE id > 0;
wait till autovacuum delete and changed xmin
             release postmaster with gdb
--- Result on replica
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be 
removed.

There is one feature of the behavior of standby, which let us to allow
the autovacuum to cut off the page table (at the end of relation) that
no one else needs (because there is only dead and removed tuples). So if
the standby SEQSCAN or another *SCAN mdread a page that is damaged or
has been deleted, it will receive a zero page, and not break the request
for ERROR.

Could you give me your ideas over these patches.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

hsfeedback_av_truncate.patchtext/x-diff; name=hsfeedback_av_truncate.patchDownload+48-4
hsfeedback_noninvalide_xmin.patchtext/x-diff; name=hsfeedback_noninvalide_xmin.patchDownload+8-0
#21Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#20)
#22Thomas Munro
thomas.munro@gmail.com
In reply to: Kartyshov Ivan (#20)
#23Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#21)
#24Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#23)
#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#24)
#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kartyshov Ivan (#20)
#27Alexander Korotkov
aekorotkov@gmail.com
In reply to: Masahiko Sawada (#26)
#28Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#30)
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#32)
#34Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#34)
#36Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#36)
#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#35)
#39Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#38)
#41Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#41)
#43Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Robert Haas (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#43)
#45Alexander Korotkov
aekorotkov@gmail.com
In reply to: Dmitry Dolgov (#43)