Synchronous replay take III

Started by Thomas Munroabout 8 years ago19 messageshackers

thomas.munro@gmail.com

about 8 years ago

Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

In previous threads[1]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com[2]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com /messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com[3]/messages/by-id/CA+CSw_tz0q+FQsqh7Zx7xxF99Jm98VaAWGdEP592e7a+zkD_Mw@mail.gmail.com I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency. Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array. For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

The guarantee is: When two transactions tx1, tx2 are run with
synchronous_replay set to on and tx1 reports successful commit before
tx2 begins, then tx1 is guaranteed either to see tx1 or to raise a new
error 40P02 if it is run on a hot standby. I have joked that that
error means "snapshot too young". You could handle it the same way
you handle deadlocks and serialization failures: by retrying, except
in this case you might want to avoid that node for a while.

Note that this feature is concerned with transaction visibility. It
is not concerned with transaction durability. It will happily kick
all of your misbehaving or slow standbys out of the array so that you
fall back to single-node commit durability. You can express your
durability requirement (ie I must have have N copies of the data on
disk before I tell any external party about a transaction) separately,
by configuring regular synchronous replication alongside this feature.
I suspect that this feature would be most popular with people who are
already using regular synchronous replication though, because they
already tolerate higher commit latency.

STATUS

Here's a quick summary of the status of this proposal as I see it:

* Simon Riggs, as the committer most concerned with the areas this
proposal touches -- namely streaming replication and specifically
syncrep -- has not so far appeared to be convinced by the value of
this approach, and has expressed a preference for pursuing client-side
or middleware tracked LSN tokens exclusively. I am perceptive enough
to see that failing to sell the idea to Simon is probably fatal to the
proposal. The main task therefore is to show convincingly that there
is a real use case for this high-level design and its set of
trade-offs, and that it justifies its maintenance burden.

* I have tried to show that there are already many users who route
their read-only queries to hot standby databases (not just "reporting
queries"), and libraries and tools to help people do that using
heuristics like "logged in users need fresh data, so primary only" or
"this session has written in the past N minutes, so primary only".
This proposal would provide a way for those users to do something
based on a guarantee instead of such flimsy heuristics. I have tried
to show that the libraries used by Python, Ruby, Java etc to achieve
that sort of load balancing should easily be able to handle finding
read-only nodes, routing read-only queries and dealing with the new
error. I do also acknowledge that such libraries could also be used
to provide transparent read-my-writes support by tracking LSNs and
injecting wait-for-LSN directives with alternative proposals, but that
is weaker than a global reads-follow-writes guarantee and the
difference can matter.

* I have argued that token-based systems are in fact rather
complicated[4]/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com and by no means a panacea. As usual, there are a whole
bunch of trade-offs. I suspect that this proposal AND fully
user-managed causality tokens (no middleware) are both valuable sweet
spots for a non-GTM system.

* Ants Aasma pointed out that this proposal doesn't provide a
read-follows-read guarantee. He is right, and I'm not sure to what
extent that is a problem, but I also think token-based systems can
probably only solve it with fairly high costs.

* Dmitry Dolgov reported a bug causing the replication protocol to get
corrupted on some OSs but not others[5]/messages/by-id/CAEepm=352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ@mail.gmail.com; could be uninitialised data
or size/padding/layout thinko or other stupid problem. (Gee, it would
be nice if the wire protocol writing and reading code were in reusable
functions instead of open-coded in multiple places... the bug could
be due to that). Unfortunately I haven't managed to track it down yet
and haven't had time to get back to this in time for the Commitfest
due to other work. Given the interest expressed by a reviewer to test
this, which might result in that problem being figured out, I figured
I might as well post the rebased patch anyway, and I will also have
another look soon.

* As Andres Freund pointed out, this currently lacks tests. It should
be fairly easy to add TAP tests to exercise this code, in the style of
the existing tests for replication.

TESTING NOTES

Set up some hot standbys, put synchronous_replay_max_lag = 2s in the
primary's postgresql.conf, then set synchronous_replay = on in every
postgresql.conf or at least in every session that you want to test
with. Then generate various write workloads and observe the primary
server's log as the leases are grant and revoke, or check the status
in pg_stat_replication's replay_lag and sync_replay columns. Verify
that you can't successfully run synchronous_replay = on transactions
on standbys that don't currently have a lease, and that you can't
trick it by cutting your network cables with scissors or killing
random processes etc. You might want to verify my claims about clock
drift and the synchronous_replay_lease_time, either mathematically or
experimentally.

Thanks for reading!

[1]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com /messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
/messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
[3]: /messages/by-id/CA+CSw_tz0q+FQsqh7Zx7xxF99Jm98VaAWGdEP592e7a+zkD_Mw@mail.gmail.com
[4]: /messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com
[5]: /messages/by-id/CAEepm=352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#1)

Re: Synchronous replay take III

On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Moved to next CF based on
/messages/by-id/24193.1519882945@sss.pgh.pa.us
.

--
Thomas Munro
http://www.enterprisedb.com

Michael Paquier

michael@paquier.xyz

about 8 years ago

In reply to: Thomas Munro (#2)

Re: Synchronous replay take III

On Thu, Mar 01, 2018 at 06:55:18PM +1300, Thomas Munro wrote:

On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Moved to next CF based on
/messages/by-id/24193.1519882945@sss.pgh.pa.us

Thanks, Thomas. This looks like the right move to me.
--
Michael

Adam Brusselback

adambrusselback@gmail.com

about 8 years ago

In reply to: Michael Paquier (#3)

Re: Synchronous replay take III

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

-Adam

Thomas Munro

thomas.munro@gmail.com

almost 8 years ago

In reply to: Adam Brusselback (#4)

Re: Synchronous replay take III

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased. Moved to September. I still need to provide a TAP test and
explain that weirdness reported by Dmitry Dolgov, but I didn't get to
that in time for the bonus early commitfest that we're now in.

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#5)

Re: Synchronous replay take III

On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased.

Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Thomas Munro (#6)

Re: Synchronous replay take III

On Mon, Sep 24, 2018 at 10:39 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased.

Rebased.

Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Thomas Munro (#1)

Re: Synchronous replay take III

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

In previous threads[1][2][3] I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency. Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array. For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Dmitry Dolgov

9erthalion6@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#8)

Re: Synchronous replay take III

On Thu, Nov 15, 2018 at 6:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

In previous threads[1][2][3] I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual"

Yeah, that was rather annoying that I couldn't get rid of this while playing
with the "take II" version :)

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

#10

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Dmitry Dolgov (#9)

Re: Synchronous replay take III

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

Thanks for your interest in this work!

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write. If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys. For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content). Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction. These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1. For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
the case may be, depending on your application.)

2. If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

/messages/by-id/5643933F.4010701@gmx.net

My questions are: Why do we have hot_standby mode? Is load balancing
a style of usage we want to support? Do we want a technology that
lets you do more of it?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users. But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages. I wrote about my understanding of the
trade-offs here:

/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Yeah. I think tools like pgpool that already know how to connect to
the primary and look at pg_stat_replication could use the new column
to learn which servers support synchronous replay, for routing
purposes. I also think that existing read/write load balancing tools
for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
adjusted to work with this quite easily.

In response to a general question from Simon Riggs at a conference
about how anyone is supposed to use this thing in real life, I wrote a
proof-of-concept Java Spring application that shows the techniques
that I think are required to make good use of it:

https://github.com/macdice/syncreplay-spring-demo

1. Use a transaction management library (this includes Python Django
transaction management, Ruby ActiveRecord IIUC, Java Spring
declarative transactions, ...), so that whole transactions can be
retried automatically. This is generally a good idea anyway because
it lets you retry automatically on serialisation failures and deadlock
errors. The new error 40P02
ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE is just another reason to
retry, in SQL error code class "40" (or perhaps is should be "72"... I
have joked that the new error could be called "snapshot too young"!)

2. Classify transactions (= blocks of code that run a transaction) as
read-write or read-only. This can be done adaptively by remembering
ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
explicitly using something like Java's @Transactional(readOnly=true)
annotations, so that the transaction management library can
automatically route transactions through the right connection.

3. Automatically avoid standby servers that have recently failed with
40P02 errors.

4. Somehow know which server is the primary (my Java POC doesn't
tackle that problem, but there are various techniques, such as trying
all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
the server that you expected to be a primary).

The basic idea is that with a little bit of help from your
language-specific transaction management infrastructure, your
application can be 100% unaware, and benefit from load balancing. The
point is that KeyValueController.java knows nothing about any of that
stuff, and all the rest is Spring configuration that allows
transactions to be routed to N database servers. It never shows you
stale data.

--
Thomas Munro
http://www.enterprisedb.com

#11

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Thomas Munro (#10)

Re: Synchronous replay take III

On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

Hm. This stuff runs under many assumptions, so I think that we should
be careful here with any changes as the very recent history has proved
(4c70336). If we were to switch WAL senders on postmaster death, I
think that this could be a change independent of what is proposed here.
--
Michael

#12

Michail Nikolaev

michail.nikolaev@gmail.com

over 7 years ago

In reply to: Michael Paquier (#11)

Re: Synchronous replay take III

Hello.

It is really nice feature. I am working on the project which heavily reads
from replicas (6 of them).

In our case we have implemented some kind of "replication barrier"
functionality based on table with counters (one counter per application
backend in simple case).
Each application backend have dedicated connection to each replica. And it
selects its counter value few times (2-100) per second from each replica in
background process (depending on how often replication barrier is used).

Once application have committed transaction it may want join replication
barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

Nice thing here - constant number of connection involved. Even if lot of
threads joining replication barrier in the moment. Even if some replicas
are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections
issue in few milliseconds in case of implementation described in this
thread.
It may be the weak part of the patch I think. At least for our case. But it
possible could be used to eliminate odd table with counters in my case (if
it possible to change setting per transaction).

Thanks a lot,
Michail.

#13

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Michail Nikolaev (#12)

Re: Synchronous replay take III

Hello,

Here is a rebased patch, and separate replies to Michael and Michail.

On Sat, Dec 1, 2018 at 4:57 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

Hm. This stuff runs under many assumptions, so I think that we should
be careful here with any changes as the very recent history has proved
(4c70336). If we were to switch WAL senders on postmaster death, I
think that this could be a change independent of what is proposed here.

Fair point. I think the effect should be the same with less code:
either way you see the server hang up without sending a COMMIT tag,
but maybe I'm missing something. Change reverted; let's discuss that
another time.

On Mon, Dec 3, 2018 at 9:01 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

It is really nice feature. I am working on the project which heavily reads from replicas (6 of them).

Thanks for your feedback.

In our case we have implemented some kind of "replication barrier" functionality based on table with counters (one counter per application backend in simple case).
Each application backend have dedicated connection to each replica. And it selects its counter value few times (2-100) per second from each replica in background process (depending on how often replication barrier is used).

Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

Once application have committed transaction it may want join replication barrier before return new data to a user. So, it increments counter in the table and waits until all replicas have replayed that value according to background monitoring process. Of course timeout, replicas health checks and few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data? My patch handles this by cancelling standbys'
leases explicitly and waiting for a response (if possible), but
otherwise waiting for them to expire (say if connectivity is lost or
standby has gone crazy or stopped responding), so that there is no
scenario where someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

Nice thing here - constant number of connection involved. Even if lot of threads joining replication barrier in the moment. Even if some replicas are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections issue in few milliseconds in case of implementation described in this thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do. Probably find another node.

It may be the weak part of the patch I think. At least for our case.

Could you please elaborate? What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees). Not sure when that would
be useful though.

But it possible could be used to eliminate odd table with counters in my case (if it possible to change setting per transaction).

Yes, the behaviour can be activated per transaction, using the usual
GUC scoping rules. The setting synchronous_replay must be on in both
the write transaction and the following read transaction for the logic
to work (ie for the writer to wait, and for the reader to make sure
that it has a valid lease or raise an error).

It sounds like my synchronous_replay GUC is quite similar to your
replication barrier system, except that it has a way to handle node
failure and excessive lag without abandoning the guarantee.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

--
Thomas Munro
http://www.enterprisedb.com

#14

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Thomas Munro (#10)

Re: Synchronous replay take III

On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

Thanks for your interest in this work!

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write. If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys. For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content). Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction. These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1. For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
the case may be, depending on your application.)

2. If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

/messages/by-id/5643933F.4010701@gmx.net

My questions are: Why do we have hot_standby mode? Is load balancing
a style of usage we want to support? Do we want a technology that
lets you do more of it?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users. But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages. I wrote about my understanding of the
trade-offs here:

/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com

Thank you for explaning. I understood the use-cases of this feature
and token-based causal reads.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Yeah. I think tools like pgpool that already know how to connect to
the primary and look at pg_stat_replication could use the new column
to learn which servers support synchronous replay, for routing
purposes. I also think that existing read/write load balancing tools
for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
adjusted to work with this quite easily.

Agreed.

In response to a general question from Simon Riggs at a conference
about how anyone is supposed to use this thing in real life, I wrote a
proof-of-concept Java Spring application that shows the techniques
that I think are required to make good use of it:

https://github.com/macdice/syncreplay-spring-demo

1. Use a transaction management library (this includes Python Django
transaction management, Ruby ActiveRecord IIUC, Java Spring
declarative transactions, ...), so that whole transactions can be
retried automatically. This is generally a good idea anyway because
it lets you retry automatically on serialisation failures and deadlock
errors. The new error 40P02
ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE is just another reason to
retry, in SQL error code class "40" (or perhaps is should be "72"... I
have joked that the new error could be called "snapshot too young"!)

2. Classify transactions (= blocks of code that run a transaction) as
read-write or read-only. This can be done adaptively by remembering
ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
explicitly using something like Java's @Transactional(readOnly=true)
annotations, so that the transaction management library can
automatically route transactions through the right connection.

3. Automatically avoid standby servers that have recently failed with
40P02 errors.

4. Somehow know which server is the primary (my Java POC doesn't
tackle that problem, but there are various techniques, such as trying
all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
the server that you expected to be a primary).

The basic idea is that with a little bit of help from your
language-specific transaction management infrastructure, your
application can be 100% unaware, and benefit from load balancing. The
point is that KeyValueController.java knows nothing about any of that
stuff, and all the rest is Spring configuration that allows
transactions to be routed to N database servers. It never shows you
stale data.

Thank you! I'll try it.

Regarding the current (v10 patch) design I have some questions and
comments.

The patch introduces new GUC parameter synchronous_replay. We can set
synchronous_commit = off while setting synchronous_replay = on. With
this setting, the backend will synchrnously wait for standbys to
replay. I'm concerned that having two separate GUC parameters
controling the transaction commit behaviour would confuse users. It's
a just idea but maybe we can use 'remote_apply' for synchronous replay
purpose and introduce new parameter for standby server something like
allow_stale_read.

If while a transaction is waiting for all standbys to replay they
became to unavailable state, should the waiter be released? the patch
seems not to release the waiter. Similarly, wal senders are not aware
of postgresql.conf change while waiting synchronous replay. I think we
should call SyncReplayPotentialStandby() in SyncRepInitConfig().

With the setting synchronous_standby_names = '' and
synchronous_replay_standby_names = '*' we would get the standby's
status in pg_stat_replication, sync_state = 'async' and sync_replay =
'available'. It looks odd to me. Yes, this status is correct in
principle. But considering the architecture of PostgreSQL replication
this status is impossible.

The synchronous_replay_standby_name = '*' setting means that the
backend wait for all standbys connected to the master server to
replay, is that right? In my test, even when some of synchronous
replay standby servers got stuck and then therefore are revoked their
lease, the backend could proceed transactions.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#15

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#14)

Re: Synchronous replay take III

On Tue, Jan 15, 2019 at 11:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Regarding the current (v10 patch) design I have some questions and
comments.

Hi Sawada-san,

Thanks for your testing and feedback.

The patch introduces new GUC parameter synchronous_replay. We can set
synchronous_commit = off while setting synchronous_replay = on. With
this setting, the backend will synchrnously wait for standbys to
replay. I'm concerned that having two separate GUC parameters
controling the transaction commit behaviour would confuse users. It's
a just idea but maybe we can use 'remote_apply' for synchronous replay
purpose and introduce new parameter for standby server something like
allow_stale_read.

That is an interesting idea. That choice means that the new mode
always implies synchronous_commit = on (since remote_apply is a
"higher" level). I wanted them to be independent, so you could
express your durability requirement separately from your visibility
requirement.

Concretely, if none of your potential sync replay standbys are keeping
up and they are all dropped to "unavailable", then you'd be able to
see a difference: with your proposal we'd still have a synchronous
commit wait, but with mine that could independently be on or off.

Generally, I think we are too restrictive in our durability levels,
and there was some discussion about whether it's OK to have a strict
linear knob (which your idea extends):

/messages/by-id/CAEepm=3FFaanSS4sugG+Apzq2tCVjEYCO2wOQBod2d7GWb=DvA@mail.gmail.com

Hmm, perhaps your way would be better for now anyway, just because
it's simpler to understand and explain. Perhaps you wouldn't need a
separate "allow_stale_read" GUC, you could just set synchronous_commit
to a lower level when talking to the standby. (That is, give
synchronous_commit a meaning on standbys, whereas currently it has no
effect there.)

If while a transaction is waiting for all standbys to replay they
became to unavailable state, should the waiter be released? the patch
seems not to release the waiter. Similarly, wal senders are not aware
of postgresql.conf change while waiting synchronous replay. I think we
should call SyncReplayPotentialStandby() in SyncRepInitConfig().

Good point about the postgresql.conf change.

If all the standbys go to unavailable state, then a waiter should be
released once they have all either acknowledged that they are
unavailable (ie acknowledged that their lease has been revoked, via a
reply message with a serial number matching the revocation message),
or if that doesn't happen (due to lost network connection, crashed
process etc), once the any leases that have been issued have expired
(ie a few seconds). Is that not what you see?

With the setting synchronous_standby_names = '' and
synchronous_replay_standby_names = '*' we would get the standby's
status in pg_stat_replication, sync_state = 'async' and sync_replay =
'available'. It looks odd to me. Yes, this status is correct in
principle. But considering the architecture of PostgreSQL replication
this status is impossible.

Yes, this is essentially the same thing that you were arguing against
above. Perhaps you are right, and there are no people who would want
synchronous replay, but not synchronous commit.

The synchronous_replay_standby_name = '*' setting means that the
backend wait for all standbys connected to the master server to
replay, is that right? In my test, even when some of synchronous
replay standby servers got stuck and then therefore are revoked their
lease, the backend could proceed transactions.

It means that it waits for all standbys that are "available" to
replay. It doesn't wait for the "unavailable" ones. Most of the
patch deals with the transitions between those states. During an
available->revoking->unavailable transition, we also wait for the
standby to know that it is unavailable (so that it begins to raise
errors), and during an unavailable->joining->available transition we
also wait for the standby to replay the transition LSN (so that it
stops raising errors). That way clients on the standby can rely on
the error (or lack of error) to tell them whether their snapshot
definitely contains every commit that has returned control on the
primary.

--
Thomas Munro
http://www.enterprisedb.com

#16

Michail Nikolaev

michail.nikolaev@gmail.com

over 7 years ago

In reply to: Thomas Munro (#15)

Re: Synchronous replay take III

Hello,

Sorry, missed email.

In our case we have implemented some kind of "replication barrier"

functionality based on table with counters (one counter per application
backend in simple case).

Each application backend have dedicated connection to each replica. And

it selects its counter value few times (2-100) per second from each replica
in background process (depending on how often replication barrier is used).

Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

Replay lsn was the first approach I tried. I was sampling 'select
replay_lsn from pg_stat_replication' on master to get info about replay
position on replicas.
However, for some unknown reason I was not able to get it to work. Because
after replay_lsn was reached - standby was unable to see the data.
I know it should not happen. I spend few days on debugging... And… Since I
was required to ping replicas anyway (to check if it is a master already,
monitor ping, locks, connections, etc.) - I have decided to introduce table
for now.

Once application have committed transaction it may want join replication

barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data?

Each standby has some health flags attached to it. Health is "red" when:
* can't connect to replica, or all connections are in use
* replica lag according to pg_last_xact_replay_timestamp is more than 3000ms
* replica lag according to pg_last_xact_replay_timestamp was more than
3000ms some time ago (10000ms)
* replica is new master now
* etc.

In case of replication barrier, we are waiting only for "green" replicas
and max for 5000ms. If we still no able to see new counter value on some
replicas - it is up to client to decide how to process it. In our case, it
means replica is lagging more than 3000ms - so it is "red" now and next
client request will dispatched to another "green" replica. It is done by
special connection pool with balancer inside.
Not sure it is all 100% correct, but we could just proceed in our case.

someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

Nice thing here - constant number of connection involved. Even if lot of

threads joining replication barrier in the moment. Even if some replicas
are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections

issue in few milliseconds in case of implementation described in this
thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do. Probably find another node.
In case of high loded

Could you please elaborate? What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees). Not sure when that would
be useful though.

Main problem I see here - is master connection usage. We have about 10.000
RPS on master. So, small lag on some replica (we have six of them) will
lead all master connections to be waiting for replay on stale replica until
timeout. It is out of service for whole system. Even if it lagged for
200-300ms (in real world it could lag for seconds on regular basis).

If we set synchronous_replay_max_lag to 10-20ms - standbys will be
cancelled all the time.
In our case, we are using constant amount of connections involved. In
addition, client requests are waiting for standby replay
inside application backend thread without blocking master connection. This
is the main difference as I think.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

Thanks - will try and report.

#17

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Thomas Munro (#15)

Re: Synchronous replay take III

On Thu, Jan 24, 2019 at 6:42 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Yes, this is essentially the same thing that you were arguing against
above. Perhaps you are right, and there are no people who would want
synchronous replay, but not synchronous commit.

Maybe I'm misunderstanding the terminology here, but if not, I find
this theory wildly implausible. *Most* people want read-your-writes
behavior. *Few* people want to wait for a dead standby. The only
application of the later is when even a tiny risk of transaction loss
is unacceptable, but the former has all kinds of clustering-related
uses.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Robert Haas (#17)

Re: Synchronous replay take III

On Fri, Feb 01, 2019 at 09:34:49AM -0500, Robert Haas wrote:

Maybe I'm misunderstanding the terminology here, but if not, I find
this theory wildly implausible. *Most* people want read-your-writes
behavior. *Few* people want to wait for a dead standby. The only
application of the later is when even a tiny risk of transaction loss
is unacceptable, but the former has all kinds of clustering-related
uses.

Last patch set fails to apply properly, so moved to next CF waiting on
author for a rebase.
--
Michael

#19

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Michael Paquier (#18)

Re: Synchronous replay take III

On Mon, Feb 4, 2019 at 4:47 PM Michael Paquier <michael@paquier.xyz> wrote:

Last patch set fails to apply properly, so moved to next CF waiting on
author for a rebase.

Thanks. Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Synchronous replay take III

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: