Proposal: "Causal reads" mode for load balancing reads without stale data
Hi hackers,
Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.
A while ago I posted a small patch[1]/messages/by-id/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag@mail.gmail.com to allow synchronous_commit to wait
for remote apply on the current synchronous standby, but (as Simon Riggs
rightly pointed out in that thread) that part isn't the main problem. It
seems to me that the main problem for a practical 'writer waits' system is
how to support a dynamic set of servers, gracefully tolerating failures and
timing out laggards, while also providing a strong guarantee during any
such transitions. Now I would like to propose something to do that, and
share a proof-of-concept patch.
=== PROPOSAL ===
The working name for the proposed feature is "causal reads", because it
provides a form of "causal consistency"[2]From http://queue.acm.org/detail.cfm?id=1466448 (and "read-your-writes"
consistency) no matter which server the client is connected to. There is a
similar feature by the same name in another product (albeit implemented
differently -- 'reader waits'; more about that later). I'm not wedded to
the name.
The feature allows arbitrary read-only transactions to be run on any hot
standby, with a specific guarantee about the visibility of preceding
transactions. The guarantee is that if you set a new GUC "causal_reads =
on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins
after tx1 successfully returns, then tx2 will either see tx1 or fail with a
new error "standby is not available for causal reads", no matter which
server it runs on. A discovery mechanism is also provided, giving an
instantaneous snapshot of the set of standbys that are currently available
for causal reads (ie won't raise the error), in the form of a new column in
pg_stat_replication.
For example, a web server might run tx1 to insert a new row representing a
message in a discussion forum on the primary server, and then send the user
to another web page that runs tx2 to load all messages in the forum on an
arbitrary hot standby server. If causal_reads = on in both tx1 and tx2
(for example, because it's on globally), then tx2 is guaranteed to see the
new post, or get a (hopefully rare) error telling the client to retry on
another server.
Very briefly, the approach is:
1. The primary tracks apply lag on each standby (including between
commits).
2. The primary deems standbys 'available' for causal reads if they are
applying WAL and replying to keepalives fast enough, and periodically sends
the standby an authorization to consider itself available for causal reads
until a time in the near future.
3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).
4. Standbys can start causal reads transactions only while they have an
authorization with an expiry time in the future; otherwise they raise an
error when an initial snapshot is taken.
In a follow-up email I can write about the design trade-offs considered
(mainly 'writer waits' vs 'reader waits'), comparison with some other
products, method of estimating replay lag, wait and timeout logic and how
it maintains the guarantee in various failure scenarios, logic for standbys
joining and leaving, implications of system clock skew between servers, or
any other questions you may have, depending on feedback/interest (but see
comments in the attached patch for some of those subjects). For now I
didn't want to clog up the intertubes with too large a wall of text.
=== PROOF-OF-CONCEPT ===
Please see the POC patch attached. It adds two new GUCs. After setting up
one or more hot standbys as per usual, simply add "causal_reads_timeout =
4s" to the primary's postgresql.conf and restart. Now, you can set
"causal_reads = on" in some/all sessions to get guaranteed causal
consistency. Expected behaviour: the causal reads guarantee is maintained
at all times, even when you overwhelm, kill, crash, disconnect, restart,
pause, add and remove standbys, and the primary drops them from the set it
waits for in a timely fashion. You can monitor the system with the
replay_lag and causal_reads_status in pg_stat_replication and some state
transition LOG messages on the primary. (The patch also supports
"synchronous_commit = apply", but it's not clear how useful that is in
practice, as already discussed.)
Lastly, a few notes about how this feature related to some other work:
The current version of this patch has causal_reads as a feature separate
from synchronous_commit, from a user's point of view. The thinking behind
this is that load balancing and data loss avoidance are separate concerns:
synchronous_commit deals with the latter, and causal_reads with the
former. That said, existing SyncRep machinery is obviously used
(specifically SyncRep queues, with a small modification, as a way to wait
for apply messages to arrive from standbys). (An earlier prototype had
causal reads as a new level for synchronous_commit and associated states as
new walsender states above 'streaming'. When contemplating how to combine
this proposal with the multiple-synchronous-standby patch, some colleagues
and I came around to the view that the concerns are separate. The reason
for wanting to configure complicated quorum definitions is to control data
loss risks and has nothing to do with load balancing requirements, so we
thought the features should probably be separate.)
The multiple-synchronous-servers patch[3]/messages/by-id/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ@mail.gmail.com could be applied or not
independently of this feature as a result of that separation, as it doesn't
use synchronous_standby_names or indeed any kind of statically defined
quorum.
The standby WAL writer patch[4]/messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com would significantly improve walreceiver
performance and smoothness which would work very well with this proposal.
Please let me know what you think!
Thanks,
[1]: /messages/by-id/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag@mail.gmail.com
/messages/by-id/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag@mail.gmail.com
[2]: From http://queue.acm.org/detail.cfm?id=1466448
"Causal consistency. If process A has communicated to process B that it has
updated a data item, a subsequent access by process B will return the
updated value, and a write is guaranteed to supersede the earlier write.
Access by process C that has no causal relationship to process A is subject
to the normal eventual consistency rules.
Read-your-writes consistency. This is an important model where process A,
after it has updated a data item, always accesses the updated value and
will never see an older value. This is a special case of the causal
consistency model."
[3]: /messages/by-id/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ@mail.gmail.com
/messages/by-id/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ@mail.gmail.com
[4]: /messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com
/messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
causal-reads-poc.patchapplication/octet-stream; name=causal-reads-poc.patchDownload+1048-124
On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.
Thanks for working on this issue.
3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).
This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/11/2015 10:23 AM, Simon Riggs wrote:
On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.Thanks for working on this issue.
+1.
3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.
Agreed. And in the write-read sequence, you don't need to wait at the
write either, it's enough that you wait just before you start doing the
read. An application might do a lot of other things between the two, so
that in most cases, there would in fact be no waiting as the record is
already applied when you perform the read.
I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for that
commit to be applied. The token could be just the XID, or the LSN of the
commit record. Or the application could generate the token and pass it
to the server in the commit, similar to how 2PC works. So the
interaction would be something like:
In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234
Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for that commit
to be applied. The token could be just the XID, or the LSN of the commit
record. Or the application could generate the token and pass it to the
server in the commit, similar to how 2PC works. So the interaction would be
something like:
In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
+1.
The LSN should be good enough IMO.
On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 11/11/2015 10:23 AM, Simon Riggs wrote:
On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com
wrote:
Many sites use hot standby servers to spread read-heavy workloads over
morehardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature
is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it
has
been applied or whether it has been applied on the standby that you
happen
to be talking to.Thanks for working on this issue.
+1.
3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that
debate
up front before we start reviewing the patch.Agreed. And in the write-read sequence, you don't need to wait at the
write either, it's enough that you wait just before you start doing the
read. An application might do a lot of other things between the two, so
that in most cases, there would in fact be no waiting as the record is
already applied when you perform the read.I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for that commit
to be applied. The token could be just the XID, or the LSN of the commit
record. Or the application could generate the token and pass it to the
server in the commit, similar to how 2PC works. So the interaction would be
something like:In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...
I thought about this question, and considered three different approaches:
1. Reader waits with exposed LSNs, as Heikki suggests. This is what
BerkeleyDB does in "read-your-writes" mode. It means that application
developers have the responsibility for correctly identifying transactions
with causal dependencies and dealing with LSNs (or whatever equivalent
tokens), potentially even passing them to other processes where the
transactions are causally dependent but run by multiple communicating
clients (for example, communicating microservices). This makes it
difficult to retrofit load balancing to pre-existing applications and (like
anything involving concurrency) difficult to reason about as applications
grow in size and complexity. It is efficient if done correctly, but it is
a tax on application complexity.
2. Reader waits for a conservatively chosen LSN. This is roughly what
MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1"
modes. Read transactions would start off by finding the current end of WAL
on the primary, since that must be later than any commit that already
completed, and then waiting for that to apply locally. That means every
read transaction waits for a complete replication lag period, potentially
unnecessarily. This is tax on readers with unnecessary waiting.
3. Writer waits, as proposed. In this model, there is no tax on readers
(they have zero overhead, aside from the added complexity of dealing with
the possibility of transactions being rejected when a standby falls behind
and is dropped from 'available' status; but database clients must already
deal with certain types of rare rejected queries/failures such as
deadlocks, serialization failures, server restarts etc). This is a tax on
writers.
My thinking was that the reason for wanting to load balance over a set of
hot standbys is because you have a very read-heavy workload, so it makes
sense to tax the writers and leave the many dominant readers unburdened, so
(3) should be better than (2) for the majority of users who want such a
configuration. (Note also that it's not a requirement to tax every write;
with this proposal you can set causal_reads to off for those transactions
where you know there is no possibility of a causally dependent read).
As for (1), my thinking was that most application developers would probably
prefer not to have to deal with that type of interface. For users who do
want to do that, it would be comparatively simple to make that possible,
and would not conflict with this proposal. This proposal could be used by
people retrofitting load balancing to an existing applications with
relative ease, or simply not wanting to have to deal with LSNs and
complexity. (I have considered proposing
pg_wait_for_xlog_replay_location(lsn, timeout) separately, which could be
called on a standby with the lsn obtained from pg_current_xlog_location()
on the primary any time after a COMMIT completes, but I was thinking of
that as a different feature addressing a different user base: people
prepared to do more work to squeeze out some extra performance.)
--
Thomas Munro
http://www.enterprisedb.com
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 11/11/2015 10:23 AM, Simon Riggs wrote:
Thanks for working on this issue.
+1.
+1. I have seen a lot of interest for something along these lines.
I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for that commit
to be applied. The token could be just the XID, or the LSN of the commit
record. Or the application could generate the token and pass it to the
server in the commit, similar to how 2PC works. So the interaction would be
something like:In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...
To avoid read anomalies (backwards timetravel) it should also be
possible to receive a token from read-only transactions based on the
latest snapshot used.
My thinking was that the reason for wanting to load balance over a set of
hot standbys is because you have a very read-heavy workload, so it makes
sense to tax the writers and leave the many dominant readers unburdened, so
(3) should be better than (2) for the majority of users who want such a
configuration. (Note also that it's not a requirement to tax every write;
with this proposal you can set causal_reads to off for those transactions
where you know there is no possibility of a causally dependent read).As for (1), my thinking was that most application developers would probably
prefer not to have to deal with that type of interface. For users who do
want to do that, it would be comparatively simple to make that possible, and
would not conflict with this proposal. This proposal could be used by
people retrofitting load balancing to an existing applications with relative
ease, or simply not wanting to have to deal with LSNs and complexity. (I
have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
separately, which could be called on a standby with the lsn obtained from
pg_current_xlog_location() on the primary any time after a COMMIT completes,
but I was thinking of that as a different feature addressing a different
user base: people prepared to do more work to squeeze out some extra
performance.)
Although I still think that 1) is the correct long term solution I
must say that I agree with the reasoning presented. I think we should
review the API in the light that in the future we might have a mix of
clients, some clients that are able to keep track of causality tokens
and either want to wait when a read request arrives, or pick a host to
use based on the token, and then there are "dumb" clients that want to
use write side waits.
Also, it should be possible to configure which standbys are considered
for waiting on. Otherwise a reporting slave will occasionally catch up
enough to be considered "available" and then cause a latency peak when
a long query blocks apply again.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 11, 2015 at 3:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.
One advantage of having writers wait is that the master and its read
slaves can't ever get too far apart. Suppose the master is generating
WAL much faster than the read slaves (or one of them) can replay it.
You might say it sucks to slow down the master to the speed the slaves
can keep up with, and that's true. On the other hand, if the master
is allowed to run ahead, then a process that sends a read query to a
standby which has gotten far behind might have to wait minutes or
hours for it to catch up. I think a lot of people are enabling
synchronous replication today just for the purpose of avoiding this
problem - keeping the two machines "together in time" makes the
overall system behavior a lot more predictable.
Also, if we made readers wait, wouldn't that require a network
roundtrip to the master every time a query on a reader wanted a new
snapshot? That seems like it would be unbearably expensive.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/11/15 4:22 AM, Thomas Munro wrote:
My thinking was that the reason for wanting to load balance over a set
of hot standbys is because you have a very read-heavy workload, so it
makes sense to tax the writers and leave the many dominant readers
unburdened, so (3) should be better than (2) for the majority of users
who want such a configuration.
One problem I can see is that even if you have a read-heavy workload,
the writes can still be a bottleneck, since they are necessarily bound
to one node. And so if the feature proposal is, we can make your reads
more consistent but the writes will become slower, then that's not a
good deal.
More generally, no matter whether you pick the writers or the readers to
wait, if you assume that read-only slaves are an application performance
feature, then it's questionable how much better such applications will
perform overall when network-bound waits are introduced in the system.
I think in practice applications that are busy enough to worry about
this don't really work like that anyway. For example, the writes should
go to a message queue and are written out whenever, with a copy kept in
a cache for display in the meantime. Maybe there could be additional
features to make managing this easier.
I think there are a lot of different variations of this in practice, not
only depending on the workload and other measurables, but also
business-dependent decisions on application behavior and degradability.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 12, 2015 at 12:10 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:
On 11/11/2015 10:23 AM, Simon Riggs wrote:
Thanks for working on this issue.
+1.
+1. I have seen a lot of interest for something along these lines.
I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for thatcommit
to be applied. The token could be just the XID, or the LSN of the commit
record. Or the application could generate the token and pass it to the
server in the commit, similar to how 2PC works. So the interactionwould be
something like:
In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...To avoid read anomalies (backwards timetravel) it should also be
possible to receive a token from read-only transactions based on the
latest snapshot used.My thinking was that the reason for wanting to load balance over a set of
hot standbys is because you have a very read-heavy workload, so it makes
sense to tax the writers and leave the many dominant readers unburdened,so
(3) should be better than (2) for the majority of users who want such a
configuration. (Note also that it's not a requirement to tax everywrite;
with this proposal you can set causal_reads to off for those transactions
where you know there is no possibility of a causally dependent read).As for (1), my thinking was that most application developers would
probably
prefer not to have to deal with that type of interface. For users who do
want to do that, it would be comparatively simple to make that possible,and
would not conflict with this proposal. This proposal could be used by
people retrofitting load balancing to an existing applications withrelative
ease, or simply not wanting to have to deal with LSNs and complexity. (I
have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
separately, which could be called on a standby with the lsn obtained from
pg_current_xlog_location() on the primary any time after a COMMITcompletes,
but I was thinking of that as a different feature addressing a different
user base: people prepared to do more work to squeeze out some extra
performance.)Although I still think that 1) is the correct long term solution I
must say that I agree with the reasoning presented. I think we should
review the API in the light that in the future we might have a mix of
clients, some clients that are able to keep track of causality tokens
and either want to wait when a read request arrives, or pick a host to
use based on the token, and then there are "dumb" clients that want to
use write side waits.
Exactly!
I see the causality tokens approach (thank you for that terminology) not so
much as a "long term" solution, but rather as an expert feature likely to
interest a small number of sophisticated users willing to take on more
responsibility in exchange for greater control. We should definitely add
support for that, and I expect the patch would be fairly simple and short.
But I believe the vast majority of users would like to be able to run new
and existing plain SQL on any node and see the data they just wrote, with
graceful failure modes, and without extra conceptual load or invasive code
changes. So I think we should cater for that mode of usage that too.
Also, it should be possible to configure which standbys are considered
for waiting on. Otherwise a reporting slave will occasionally catch up
enough to be considered "available" and then cause a latency peak when
a long query blocks apply again.
Good point. Here's a new version which adds the GUC
causal_reads_standby_names, defaulting to '*' (but as before, the feature
is not activated until you set causal_reads_timeout). Now you can list
standby names explicitly if you want a way to exclude certain standbys.
Also, I noticed that cascaded standbys shouldn't be available for causal
reads, so I added a check for that.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
causal-reads-v2.patchapplication/octet-stream; name=causal-reads-v2.patchDownload+1117-127
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
1. Reader waits with exposed LSNs, as Heikki suggests. This is what
BerkeleyDB does in "read-your-writes" mode. It means that application
developers have the responsibility for correctly identifying transactions
with causal dependencies and dealing with LSNs (or whatever equivalent
tokens), potentially even passing them to other processes where the
transactions are causally dependent but run by multiple communicating
clients (for example, communicating microservices). This makes it
difficult to retrofit load balancing to pre-existing applications and (like
anything involving concurrency) difficult to reason about as applications
grow in size and complexity. It is efficient if done correctly, but it is
a tax on application complexity.
Agreed. This works if you have a single transaction connected thru a pool
that does statement-level load balancing, so it works in both session and
transaction mode.
I was in favour of a scheme like this myself, earlier, but have more
thoughts now.
We must also consider the need for serialization across sessions or
transactions.
In transaction pooling mode, an application could get assigned a different
session, so a token would be much harder to pass around.
2. Reader waits for a conservatively chosen LSN. This is roughly what
MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
1" modes. Read transactions would start off by finding the current end
of WAL on the primary, since that must be later than any commit that
already completed, and then waiting for that to apply locally. That means
every read transaction waits for a complete replication lag period,
potentially unnecessarily. This is tax on readers with unnecessary waiting.
This tries to make it easier for users by forcing all users to experience a
causality delay. Given the whole purpose of multi-node load balancing is
performance, referencing the master again simply defeats any performance
gain, so you couldn't ever use it for all sessions. It could be a USERSET
parameter, so could be turned off in most cases that didn't need it. But
its easier to use than (1).
Though this should be implemented in the pooler.
3. Writer waits, as proposed. In this model, there is no tax on readers
(they have zero overhead, aside from the added complexity of dealing with
the possibility of transactions being rejected when a standby falls behind
and is dropped from 'available' status; but database clients must already
deal with certain types of rare rejected queries/failures such as
deadlocks, serialization failures, server restarts etc). This is a tax on
writers.
This would seem to require that all readers must first check with the
master as to which standbys are now considered available, so it looks like
(2).
The alternative is that we simply send readers to any standby and allow the
pool to work out separately whether the standby is still available, which
mostly works, but it doesn't handle sporadic slow downs on particular
standbys very well (if at all).
I think we need to look at whether this does actually give us anything, or
whether we are missing the underlying Heisenberg reality.
More later.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:1. Reader waits with exposed LSNs, as Heikki suggests. This is what
BerkeleyDB does in "read-your-writes" mode. It means that application
developers have the responsibility for correctly identifying transactions
with causal dependencies and dealing with LSNs (or whatever equivalent
tokens), potentially even passing them to other processes where the
transactions are causally dependent but run by multiple communicating
clients (for example, communicating microservices). This makes it
difficult to retrofit load balancing to pre-existing applications and (like
anything involving concurrency) difficult to reason about as applications
grow in size and complexity. It is efficient if done correctly, but it is
a tax on application complexity.Agreed. This works if you have a single transaction connected thru a pool
that does statement-level load balancing, so it works in both session and
transaction mode.I was in favour of a scheme like this myself, earlier, but have more
thoughts now.We must also consider the need for serialization across sessions or
transactions.In transaction pooling mode, an application could get assigned a different
session, so a token would be much harder to pass around.2. Reader waits for a conservatively chosen LSN. This is roughly what
MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
1" modes. Read transactions would start off by finding the current end
of WAL on the primary, since that must be later than any commit that
already completed, and then waiting for that to apply locally. That means
every read transaction waits for a complete replication lag period,
potentially unnecessarily. This is tax on readers with unnecessary waiting.This tries to make it easier for users by forcing all users to experience
a causality delay. Given the whole purpose of multi-node load balancing is
performance, referencing the master again simply defeats any performance
gain, so you couldn't ever use it for all sessions. It could be a USERSET
parameter, so could be turned off in most cases that didn't need it. But
its easier to use than (1).Though this should be implemented in the pooler.
3. Writer waits, as proposed. In this model, there is no tax on readers
(they have zero overhead, aside from the added complexity of dealing with
the possibility of transactions being rejected when a standby falls behind
and is dropped from 'available' status; but database clients must already
deal with certain types of rare rejected queries/failures such as
deadlocks, serialization failures, server restarts etc). This is a tax on
writers.This would seem to require that all readers must first check with the
master as to which standbys are now considered available, so it looks like
(2).
No -- in (3), that is this proposal, standbys don't check with the primary
when you run a transaction. Instead, the primary sends a constant stream
of authorizations (in the form of keepalives sent every
causal_reads_timeout / 2 in the current patch) to the standby, allowing it
to consider itself available for a short time into the future (currently
now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I
can elaborate on that logic in a separate email). At the start of a
transaction in causal reads mode (the first call to GetTransaction to be
specific), the standby knows immediately without communicating with the
primary whether it can proceed or must raise the error. In the happy case,
the reader simply compares the most recently received authorization's
expiry time with the system clock and proceeds. In the worst case, when
contact is lost between primary and standby, the primary must stall
causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).
Doing that makes sure that no causal reads commit can return (see
CausalReadsCommitCanReturn) before the lost standby has definitely started
raising the error for causal_reads queries (because its most recent
authorization has expired), in case it is still alive and handling requests
from clients.
It is not at all like (2), which introduces a conservative wait at the
start of every read transaction, slowing all readers down. In (3), readers
don't wait, they run (or are rejected) as fast as possible, but instead the
primary has to do extra things. Hence my categorization of (2) as a 'tax
on readers', and of (3) as a 'tax on writers'. The idea is that a site
with a high ratio of reads to writes would prefer zero-overhead reads.
The alternative is that we simply send readers to any standby and allow
the pool to work out separately whether the standby is still available,
which mostly works, but it doesn't handle sporadic slow downs on particular
standbys very well (if at all).
This proposal does handle sporadic slowdowns on standbys: it drops them
from the set of available standbys if they don't apply fast enough, all the
while maintaining the guarantee. Though occurs to me that it probably
needs some kind of defence against too much flapping between available and
unavailable (maybe some kind of back off on the 'joining' phase that
standbys go through when they transition from unavailable to available in
the current patch, which I realize I haven't described yet -- but I don't
want to get bogged down in details, while we're talking about the 30,000
foot view).
I think we need to look at whether this does actually give us anything, or
whether we are missing the underlying Heisenberg reality.
--
Thomas Munro
http://www.enterprisedb.com
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:1. Reader waits with exposed LSNs, as Heikki suggests. This is what
BerkeleyDB does in "read-your-writes" mode. It means that application
developers have the responsibility for correctly identifying transactions
with causal dependencies and dealing with LSNs (or whatever equivalent
tokens), potentially even passing them to other processes where the
transactions are causally dependent but run by multiple communicating
clients (for example, communicating microservices). This makes it
difficult to retrofit load balancing to pre-existing applications and (like
anything involving concurrency) difficult to reason about as applications
grow in size and complexity. It is efficient if done correctly, but it is
a tax on application complexity.Agreed. This works if you have a single transaction connected thru a pool
that does statement-level load balancing, so it works in both session and
transaction mode.I was in favour of a scheme like this myself, earlier, but have more
thoughts now.We must also consider the need for serialization across sessions or
transactions.In transaction pooling mode, an application could get assigned a different
session, so a token would be much harder to pass around.
Sorry for the double reply, I just wanted to add a couple more thoughts.
As discussed elsewhere in the thread, I think it makes absolute sense to
offer some kind of support for causality tokens, I don't see that on its
own as enough for most users. (At the least, it would be good to have
pg_wait_for_xlog_replay_location(lsn, timeout), but perhaps explicit BEGIN
syntax as suggested by Heikki, or a new field in the libpq protocol which
can be attached to any statement, and likewise for the commit LSN of
results).
It's true that a pooling system/middleware could spy on your sessions and
insert causality token handling imposing a global ordering of visibility
for you, so that naive users don't have to deal with them. Whenever it
sees a COMMIT result (assuming they are taught to return LSNs), it could
update a highest-LSN-seen variable, and transparently insert a wait for
that LSN into every transaction that it sees beginning. But then you would
have to push all your queries through a single point that can see
everything across all Postgres servers, and maintain this global high LSN.
In contrast, my writer-waits proposal makes different trade-offs to provide
causal reads as a built-in feature without an external single point
observer of all transactions.
--
Thomas Munro
http://www.enterprisedb.com
On 12 November 2015 at 18:25, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
I don't want to get bogged down in details, while we're talking about the
30,000 foot view).
Hmm, if that's where we're at, I'll summarize my thoughts.
All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.
We need a good balancer. Any discussion of this that ignores the balancer
component is only talking about half the solution. What we need to do is
decide whether functionality should live in the balancer or the core.
Your option (1) is viable, but only in certain cases. We could add support
for some token/wait mechanism but as you say, this would require
application changes not pooler changes.
Your option (2) is wider but also worse in some ways. It can be implemented
in a pooler.
Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we have
synchronous_commit = apply but with a timeout rather than a wait. So
anyway, consider me nudged to finish my patch to provide capability for
that by 1 Jan.
On a related note, any further things like "GUC causal_reads_standby_names"
should be implemented by Node Registry as a named group of nodes. We can
have as many arbitrary groups of nodes as we want. If that sounds strange
look back at exactly why GUCs are called GUCs.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Hmm, if that's where we're at, I'll summarize my thoughts.
All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.
Agreed. I think that's a pretty common pattern, though certainly not
the only one.
We need a good balancer. Any discussion of this that ignores the balancer
component is only talking about half the solution. What we need to do is
decide whether functionality should live in the balancer or the core.
I'm all in favor of having a load-balancer in core, but that seems
completely unrelated to the patch at hand.
Your option (1) is viable, but only in certain cases. We could add support
for some token/wait mechanism but as you say, this would require application
changes not pooler changes.
Agreed.
Your option (2) is wider but also worse in some ways. It can be implemented
in a pooler.Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we have synchronous_commit
= apply but with a timeout rather than a wait. So anyway, consider me nudged
to finish my patch to provide capability for that by 1 Jan.
I don't see how either option (2) or option (3) could be implemented
in a pooler. How would that work?
To be frank, it's starting to seem to me like you are just trying to
block this patch so you can have time to develop your own version
instead. I hope that's not the case, because it would be quite
unfair. When Thomas originally posted the patch, you complained that
"This causes every writer to wait. What we want is to isolate the wait
only to people performing a write-read sequence, so I think it should
be readers that wait. Let's have that debate up front before we start
reviewing the patch." Now, you seem to be saying that's OK, because
you want to post a patch to do exactly the same thing under the name
synchronous_commit=apply, but you want it to be your own patch,
leaving out the other stuff that Thomas has put into this one.
That could be the right thing to do, but how about we discuss it a
bit? The timeout stuff that Thomas has added here is really useful.
Without that, if a machine goes down, we wait forever. That's the
right thing to do if we're replicating to make sure transactions can
never be lost, but it's a bad idea if we're replicating for load
balancing. In the load balancing case, you want to drop sync slaves
quickly to ensure the cluster remains available, and you need them to
know they are out of sync so that the load balancer doesn't get
confused. That's exactly what is implemented here.
If you have an idea for a simpler implementation, great, but I think
we need something. I don't see how it's going to work to make it
entirely the pooler's job to figure out whether the cluster is in sync
- it needs a push from the core server. Here, that push is easy to
find: if a particular replica starts returning the "i'm out of sync"
error when you query it, then stop routing queries to that replica
until the error clears (which the pooler can find out by polling it
with some trivial query). That's a great deal more useful than
synchronous_commit=apply without such a feature.
On a related note, any further things like "GUC causal_reads_standby_names"
should be implemented by Node Registry as a named group of nodes. We can
have as many arbitrary groups of nodes as we want. If that sounds strange
look back at exactly why GUCs are called GUCs.
I think a node registry is a good idea, and my impression from the
session in Vienna is quite a few other hackers do, too. But I also
don't think it's remotely reasonable to make that a precondition for
accepting this patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 15, 2015 at 11:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 November 2015 at 18:25, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:I don't want to get bogged down in details, while we're talking about
the 30,000 foot view).Hmm, if that's where we're at, I'll summarize my thoughts.
All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.We need a good balancer. Any discussion of this that ignores the balancer
component is only talking about half the solution. What we need to do is
decide whether functionality should live in the balancer or the core.Your option (1) is viable, but only in certain cases. We could add support
for some token/wait mechanism but as you say, this would require
application changes not pooler changes.Your option (2) is wider but also worse in some ways. It can be
implemented in a pooler.Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we have
synchronous_commit = apply but with a timeout rather than a wait. So
anyway, consider me nudged to finish my patch to provide capability for
that by 1 Jan.
Just to be clear, this patch doesn't use a "timeout rather than a wait".
It always waits for the current set of available causal reads standbys to
apply the commit. It's just that nodes get kicked out of that set pretty
soon if they don't keep up, a bit like a RAID controller dropping a failing
disk. And it does so using a protocol that ensures that the dropped
standby starts raising the error, even if contact has been lost with it, so
the causal reads guarantee is maintained at all times for all clients.
On a related note, any further things like "GUC causal_reads_standby_names"
should be implemented by Node Registry as a named group of nodes. We can
have as many arbitrary groups of nodes as we want. If that sounds strange
look back at exactly why GUCs are called GUCs.
Agreed, the application_name whitelist stuff is clunky. I left it out of
the first version I posted, not wanting the focus of this proposal to be
side-tracked. But as Ants Aasma pointed out, some users might need
something like that, so I posted a 2nd version that follows the established
example, again not wanting to distract with anything new in that area. Of
course that would eventually be replaced/improved as part of a future node
topology management project.
--
Thomas Munro
http://www.enterprisedb.com
On 15 November 2015 at 14:50, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:Hmm, if that's where we're at, I'll summarize my thoughts.
All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.Agreed. I think that's a pretty common pattern, though certainly not
the only one.
It looks to me this functionality is only of use in a pooler. Please
explain how else this would be used.
Your option (2) is wider but also worse in some ways. It can be
implemented
in a pooler.
Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we havesynchronous_commit
= apply but with a timeout rather than a wait.
I don't see how either option (2) or option (3) could be implemented
in a pooler. How would that work?
My starting thought was that (1) was the only way forwards. Through
discussion, I now see that its not the best solution for the general case.
The pooler knows which statements are reads and writes, it also knows about
transaction boundaries, so it is possible for it to perform the waits for
either (2) or (3). The pooler *needs* to know which nodes it can route
queries to, so it looks to me that the pooler is the best place to put
waits and track status of nodes, no matter when we wait. I don't see any
benefit in having other nodes keep track of node status since that will
just replicate work that *must* be performed in the pooler.
I would like to see a load balancing pooler in Postgres.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16 November 2015 at 18:44, Simon Riggs <simon@2ndquadrant.com> wrote:
The pooler knows which statements are reads and writes
I think that's an iffy assumption. It's one we tend to make because
otherwise read/write pooling won't work, but in PostgreSQL there's really
no way to know when calling a function.
What does
SELECT get_user_stats()
do? The pooler has _no_ _idea_ unless manually configured with knowledge
about particular user defined functions.
In the absence of such knowledge it can:
- send the work to a replica and report the ERROR to the user if it fails
due to an attempted write;
- send the work to a replica, capture an ERROR due to attempted write, and
retry on the master;
- send everything it's not sure about to the master
If a pooler had insight into the catalogs and if we had readonly /
readwrite attributes on functions, it could be smarter.
I would like to see a load balancing pooler in Postgres.
Given the number of times I say "no, no, don't raise max_connections to
2000 to solve your performance problems, lower it to around 100 and put
pgbouncer in front if your application doesn't support connection pooling
internally" .... yes!
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 16 November 2015 at 11:01, Craig Ringer <craig@2ndquadrant.com> wrote:
On 16 November 2015 at 18:44, Simon Riggs <simon@2ndquadrant.com> wrote:
The pooler knows which statements are reads and writes
I think that's an iffy assumption.
It's not an assumption, its a requirement. If it can't do this in some
manner then you can't use a load balancing pooler.
Randomly submitting things works as well, since it leads to a write error
when you try to write data on a read only server, so you do then learn
whether it is a read or a write. Once you know its a write, you submit to
master. But you still need to be careful of other effects, so that isn't
recommended.
It's one we tend to make because otherwise read/write pooling won't work,
but in PostgreSQL there's really no way to know when calling a function.
What does
SELECT get_user_stats()
do? The pooler has _no_ _idea_ unless manually configured with knowledge
about particular user defined functions.In the absence of such knowledge it can:
- send the work to a replica and report the ERROR to the user if it fails
due to an attempted write;
- send the work to a replica, capture an ERROR due to attempted write, and
retry on the master;
- send everything it's not sure about to the masterIf a pooler had insight into the catalogs and if we had readonly /
readwrite attributes on functions, it could be smarter.
pgpool supports white/black function listing for exactly this reason.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15 November 2015 at 10:41, Simon Riggs <simon@2ndquadrant.com> wrote:
So anyway, consider me nudged to finish my patch to provide capability
for that by 1 Jan.
My earlier patch aimed to allow WALReceiver to wait on both a latch and a
socket as well as allow WALWriter to be active, so that WALReceiver could
reply more quickly and handle greater workload. As I explained previously
when we discussed that in recent posts, it is necessary infrastructure to
have anybody wait on anything higher than remote-fsync. I aim to complete
that by 1 Jan.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 17, 2015 at 12:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 15 November 2015 at 10:41, Simon Riggs <simon@2ndquadrant.com> wrote:
So anyway, consider me nudged to finish my patch to provide capability
for that by 1 Jan.My earlier patch aimed to allow WALReceiver to wait on both a latch and a
socket as well as allow WALWriter to be active, so that WALReceiver could
reply more quickly and handle greater workload. As I explained previously
when we discussed that in recent posts, it is necessary infrastructure to
have anybody wait on anything higher than remote-fsync. I aim to complete
that by 1 Jan.
Right, handing write/fsync work off to WALWriter in standbys makes a lot of
sense for any kind of writer-waits system, so that WALReceiver doesn't
spend time in long syscalls which wouldn't play nicely with signals
(whether from 'kill' or SetLatch) and can deal with network IO with the
lowest possible latency. I would like to help test/review that, if that
could be useful.
The SIGUSR1 code in the WalReceiverMain and WalRecvWakeup in this patch
works well enough for now for proof-of-concept purposes until then.
--
Thomas Munro
http://www.enterprisedb.com