Transaction Snapshots and Hot Standby
Transaction snapshots is probably the most difficult problem for Hot
Standby to resolve. We *must* have a transaction snapshot to access
table data in a consistent way, or we must accept some inconsistency,
though that is not simple either.
We can derive transaction snapshots
* remotely from primary node
* locally on the standby node
If we derive a snapshot locally, then we will end up with a situation
where the xmin of the local snapshot precedes the xmin of the primary
node. When this occurs it will then be possible for WAL records to
arrive on the standby that request removal of rows that a transaction
might wish to see. Preventing that situation can be done by either
deferring WAL apply or by cancelling queries.
We can defer WAL apply for particular tables only, but this could
significantly complicate the apply process and is not a suggested option
for the first release of this feature.
We might control query cancellation by tracking which tables have had
rows removed that would have been visible to particular queries. Again,
possible but suffers from the problem that tables on which HOT is
frequently active would be almost unusable. So not a general solution.
Simply ignoring WAL removal has been discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php
So the most generally applicable and simplest initial solution for
generating snapshots is to take them from the remote primary node. The
proposal for this follows:
Transaction snapshots will be managed by a new process, Snapproc.
Snapproc will start only in recovery mode and will exit when normal
processing mode begins. Snapshot process will connect to the remote
primary node and export snapshot data and copy this to shared memory on
the standby node. When all standby backends have run
UnregisterSnapshot() then the snapshot will then be unregistered on the
remote primary node also.
The standby must not think a transaction is visible until all changes
made by it are have been applied. So snapshots from the primary cannot
be used on the standby until the LSN at which they were taken has been
reached by WAL apply on the standby. Snapshots don't normally have an
LSN, so we must derive that information as well when we take a Snapshot.
With asynchronous commits a transaction can be committed and yet not yet
written to disk, so we cannot rely on the current WAL write pointer.
Taking the WAL insert pointer is the safest thing to do, though most
heavily contended. We don't want to hold ProcArrayLock while requesting
WALInsertLock, so we will get the LSN of the WAL insert pointer *after*
the Snapshot has been derived (it doesn't matter that much, as long as
its not before the snapshot).
So snapshots from the primary take time before they can be used. The
delay is exactly the current processing delay from primary to standby,
so another reason why we do not wish to fall behind.
Taking snapshots from primary has a few disadvantages
* snapshots take time before they are usable
* requesting new snapshots is via remote request
* snapshots on primary prevent row removal (but this was also an
advantage of this technique!)
If primary and standby node are connected by private ethernet then the
snapshot request time will be ~10ms, though that is probably 100 times
slower than current snapshot access. If primary and standby are opposite
sides of an ocean then times could be as high as 500ms. Taking snapshots
directly could be annoyingly slow for small read-only statements, so we
need to buffer this process in various ways. For larger queries, this
may not be a problem at all, but we do not wish to limit or assume the
uses of the standby node.
First of all, each backend will have the opportunity to reuse previous
snapshots both within transactions and across them. A userset parameter
snapshot_reuse_window = 0..60000ms will define the time window in which
any request for a new snapshot will simply result in being fed the last
snapshot again. When the window on a snapshot has expired a newer
snapshot will be presented instead. This idea is similar to serializable
transactions, which continually reuse the same snapshot. This is a
useful parameter for normal processing as well, since it will reduce
contention on the ProcArrayLock for many applications.
Snapshots can be reused across transactions in recovery mode, since they
are held in shared memory. Amount of shared memory dedicated to storing
snapshots will be max_connections * max size of snapshots.
Since there is a delay between obtaining a new snapshot and it becoming
usable the Snapshot process will buffer them until they become "mature",
like a good Whiskey. Snapshot process will take regular snapshots and
pre-age them so that when a backend requests a snapshot it will be given
the most recently matured snapshot. Time between snapshots is set by
snapshot_preread_timeout = 0..60000ms. If a request for a snapshot
arrives and there are no snapshots waiting to mature then this will
trigger snapshot process to request a new primary snapshot. (This
parameter could be automatically set based upon the arrival rate of
snapshot requests, but this is a something to consider later).
If snapshot_reuse_window = 0 then a backend will be presented with a
freshly obtained snapshot and will then wait until the exact first
moment it can be used before returning.
We can continue to reuse snapshots from the primary even if the primary
crashes, becomes disconnected or is shutdown/restarted. New snapshots
are obviously not possible until it appears again. It's not that common
for us to lose contact with the primary *and* for it to *not* be a
failover, so this seems like an acceptable restriction.
Other alternatives? Maybe.
In earlier discussions, I suggested that we might use "Read Uncommitted"
mode for use with Hot Standby. Tom pointed out that what I had suggested
was not "Read Uncommitted" as described by SQL Standard.
For Hot Standby, I've had a look at the various transaction isolation
modes possible to see which, if any, are desirable:
1. The first is to just accept that transaction *isolation* is not
possible, but carry on anyway. We might call this READ INCONSISTENT -
which carries the problems noted by Tom earlier. This mode allows us to
"see" committed deletes because their row versions might be missing from
our query, it will cause updated records to disappear from queries and
yet at the same time for newly inserted data to be invisible. However,
if the table is insert only or read only this will give a consistent
result, so this mode *is* appealing to many potential users. (It's been
suggested to me twice, independently.) We would need to derive a
snapshot based upon which transactions are "running" according to the
replay of transactions during recovery.
2. The second is to throw an error if transaction isolation is violated.
Currently we can only do this approximately, by recording the global LSN
at start of query and then aborting the query if we touch a data block
that has changes made after the LSN. Slightly more extreme version of
(1), but always consistent if it completes. (Somebody has joked we could
make the error message "snapshot too old").
We can implement both of those with an additional parameter,
allow_xmin_advance = off (default) | on Mode (1) is allow_xmin_advance =
on in READ COMMITTED mode Mode (2) is allow_xmin_advance = on in
SERIALIZABLE mode
3. We can also implement Read Uncommitted mode itself in the database.
We would get inconsistent results for updates and deletes *and* inserts
but at least we would always see the latest version of an updated row,
rather than skipping them completely. On a read only table this might be
very useful. On an insert only table this might result in some long
running queries as a scan struggles to keep up with inserts!
For (3) we would need to retrieve tuple from executor scan nodes in
SnapshotDirty mode. If we see a tuple we would follow its tuple chain to
another row on the same block using EvalPlanQual() logic. If we are
using an IndexScan and the update chain goes off-block then there will
be another index pointer that we (may) see to access the tuple, so we
would stop at that point. If we are using a SeqScan we should follow the
tuple chain until we see the top tuple, even if it goes off-block.
Bitmap scans would cause problems.
Deriving snapshots from the master seems the most obvious, but lets see
where the dice roll...
Other topics on Hot Standby will be covered later; this is just the
first part.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
Taking snapshots from primary has a few disadvantages
...
* snapshots on primary prevent row removal (but this was also an
advantage of this technique!)
That makes it an awful solution for high availability. A backend hung in
transaction-in-progress state in the slave will prevent row removal on
the master. Isolating the master from queries done performed in the
slave is exactly the reason why people use hot standby. And running long
reporting queries in the standby is again a very typical use case.
And still we can't escape the scenario that the slave receives a WAL
record that vacuums away a tuple that's still visible according to a
snapshot used in the slave. Even with the proposed scheme, this can happen:
1. Slave receives a snapshot from master
2. A long-running transaction begins on the slave, using that snapshot
3. Network connection is lost
4. Master hits a timeout, and decides to discard the snapshot it sent to
the slave
5. A tuple visible to the snapshot is vacuumed
6. Network connection is re-established
7. Slave receives the vacuum WAL record, even though the long-running
transaction still needs the tuple.
I like the idea of acquiring snapshots locally in the slave much more.
As you mentioned, the options there are to defer applying WAL, or cancel
queries. I think both options need the same ability to detect when
you're about to remove a tuple that's still visible to some snapshot,
just the action is different. We should probably provide a GUC to
control which you want.
However, if we still to provide the behavior that "as long as the
network connection works, the master will not remove tuples still needed
in the slave" as an option, a lot simpler implementation is to
periodically send the slave's oldest xmin to master. Master can take
that into account when calculating its own oldest xmin. That requires a
lot less communication than the proposed scheme to send snapshots back
and forth. A softer version of that is also possible, where the master
obeys the slave's oldest xmin, but only up to a point.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
I like the idea of acquiring snapshots locally in the slave much more.
As you mentioned, the options there are to defer applying WAL, or cancel
queries.
More exotic ways to defer applying WAL include using some smart
filesystems to get per-backend data snapshots, using either
copy-of-write overlay filesystems and filesystem or disk level
snapshots.
Al least the disk level snapshots exist in SAN-s with aim of easing
backups, though I'm not sure if it is effective for use hot standby
intended use.
Using any of those needs detecting and bypassing shared buffers if they
hold "too new" data pages and reading these pages directly from disk
snapshot.
I think both options need the same ability to detect when
you're about to remove a tuple that's still visible to some snapshot,
just the action is different. We should probably provide a GUC to
control which you want.
We probably need to have two LSN's per page to make maximal use of our
MVCC in Hot Standby situation, so we can distinguish addition to a page,
which implies no data loss from row removal which does. Currently only
Vacuum and Hot pruning can cause row removal.
However, if we still to provide the behavior that "as long as the
network connection works, the master will not remove tuples still needed
in the slave" as an option, a lot simpler implementation is to
periodically send the slave's oldest xmin to master. Master can take
that into account when calculating its own oldest xmin. That requires a
lot less communication than the proposed scheme to send snapshots back
and forth. A softer version of that is also possible, where the master
obeys the slave's oldest xmin, but only up to a point.
That point could be statement_timeout or (currently missing)
transaction_timeout
Also, decision to advance xmin should probably be sent to slave as well,
even though it is not something that is needed in local WAL logs.
--------------
Hannu
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
Taking snapshots from primary has a few disadvantages
...
* snapshots on primary prevent row removal (but this was also an
advantage of this technique!)That makes it an awful solution for high availability.
Please be careful about making such statements. People might think you
were opposing the whole idea of Hot Standby, rather than giving an
opinion about one suggestion out of many implementation proposals.
Looks like you've got some good additional suggestions later in the
post. I'll reply later to those, so thanks for that.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Heikki Linnakangas wrote:
Simon Riggs wrote:
Taking snapshots from primary has a few disadvantages
...
* snapshots on primary prevent row removal (but this was also an
advantage of this technique!)That makes it an awful solution for high availability. A backend hung in
transaction-in-progress state in the slave will prevent row removal on
the master. Isolating the master from queries done performed in the
slave is exactly the reason why people use hot standby. And running long
reporting queries in the standby is again a very typical use case.
I have to say I agree with Heikki here. Blocking the master based on
what the slave is doing seems to make host standby less useful than warm.
I like the idea of acquiring snapshots locally in the slave much more.
It's the option that I can see people (well, me) understanding the
easiest. All the others sound like ways to get things wrong. As for
inconsistent query-results - that way madness lies. How on earth will
anyone be able to diagnose or report bugs when they occur?
As you mentioned, the options there are to defer applying WAL, or cancel
queries. I think both options need the same ability to detect when
you're about to remove a tuple that's still visible to some snapshot,
just the action is different. We should probably provide a GUC to
control which you want.
I think there's only one value here: "hot standby wal delay time before
cancelling query". Might be a shorter name.
--
Richard Huxton
Archonet Ltd
On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:
I have to say I agree with Heikki here. Blocking the master based on
what the slave is doing seems to make host standby less useful than warm.
I agree also, that why I flagged it up for discussion.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:
I have to say I agree with Heikki here. Blocking the master based on
what the slave is doing seems to make host standby less useful than warm.I agree also, that why I flagged it up for discussion.
So as far as I can see there are basically two option here. Either
a) transactions with live snapshots on the slave prevent the master from being
able to vacuum tuples (which defeats the purpose of having a live standby
server for some users).
or
b) vacuum on the server which cleans up a tuple the slave has in scope has to
block WAL reply on the slave (which I suppose defeats the purpose of having
a live standby for users concerned more with fail-over latency).
Is there any middle ground or brilliant ways to get the best of both worlds?
If not it seems to me the latter is preferable since at least the consequence
of having a long-running query on the slave occurs on the same machine running
the query. And the work-around -- killing the long-running query -- requires
taking action on the same machine as the consequences. Also, when you take
action it fixes the problem immediately as WAL reply can recommence which
seems like a better deal than a bloated database.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!
On Thu, Sep 11, 2008 at 7:18 AM, Gregory Stark <stark@enterprisedb.com> wrote:
a) transactions with live snapshots on the slave prevent the master from being
able to vacuum tuples (which defeats the purpose of having a live standby
server for some users).or
b) vacuum on the server which cleans up a tuple the slave has in scope has to
block WAL reply on the slave (which I suppose defeats the purpose of having
a live standby for users concerned more with fail-over latency).
There is nothing stopping you from setting up two (or more) slave
servers, with one designated as failover that doens't serve queries,
right? Option b seems pretty reasonable to me, although I'd prefer to
block wal replay vs canceling queries...although it might be nice to
manually be able to force wal replay 'with query cancel' via a
checkpoint.
merlin
Merlin Moncure wrote:
There is nothing stopping you from setting up two (or more) slave
servers, with one designated as failover that doens't serve queries,
right?
I'd imagine that even if applying the WAL on the slave is blocked, it's
still streamed from the master to the slave, and in case of failover the
slave will fast-forward before starting up as the new master. Of course,
if it has fallen 3 days behind because of a giant reporting query, it
can take a while to replay all the WAL that has accumulated.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
I'd imagine that even if applying the WAL on the slave is blocked, it's
still streamed from the master to the slave, and in case of failover the
slave will fast-forward before starting up as the new master.
Which begs the question: what happens with a query which is running on
the slave in the moment when the slave switches from recovery mode and
starts up ? Should the running queries be canceled if they are blocking
applying of WAL, to allow start-up, or let them finish ?
Cheers,
Csaba.
Gregory Stark wrote:
b) vacuum on the server which cleans up a tuple the slave has in scope has to
block WAL reply on the slave (which I suppose defeats the purpose of having
a live standby for users concerned more with fail-over latency).
One problem with this, BTW, is that if there's a continuous stream of
medium-length transaction in the slave, each new snapshot taken will
prevent progress in the WAL replay, so the WAL replay will advance in
"baby steps", and can fall behind indefinitely. As soon as there's a
moment that there's no active snapshot, it can catch up, but if the
slave is seriously busy, that might never happen.
Nevertheless, I think it's a much nicer approach.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Csaba Nagy wrote:
On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
I'd imagine that even if applying the WAL on the slave is blocked, it's
still streamed from the master to the slave, and in case of failover the
slave will fast-forward before starting up as the new master.Which begs the question: what happens with a query which is running on
the slave in the moment when the slave switches from recovery mode and
starts up ? Should the running queries be canceled if they are blocking
applying of WAL, to allow start-up, or let them finish ?
Depends on application, I'd say. I guess we'll need both, like the smart
and fast shutdown modes.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
One problem with this, BTW, is that if there's a continuous stream of
medium-length transaction in the slave, each new snapshot taken will
prevent progress in the WAL replay, so the WAL replay will advance in
"baby steps", and can fall behind indefinitely.
Why would it fall behind indefinitely ? It only should fall behind to
the "blocking horizon", which should be the start of the longest
currently running transaction... which should be continually advancing
and not too far in the past if there are only medium length transactions
involved. Isn't normal WAL recovery also doing baby-steps, one WAL
record a time ? ;-)
Cheers,
Csaba.
On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote:
Transaction snapshots is probably the most difficult problem for Hot
Standby to resolve.
* remotely from primary node
* locally on the standby nodeIf we derive a snapshot locally, then we will end up with a situation
where the xmin of the local snapshot precedes the xmin of the primary
node. When this occurs it will then be possible for WAL records to
arrive on the standby that request removal of rows that a transaction
might wish to see. Preventing that situation can be done by either
deferring WAL apply or by cancelling queries.
Which operations can request row removal? Isn't that just specific
operations that have their own 'is this save to remove' calculations
anyway (i.e. vacuum and HOT prune)?
What I am thinking about is a design where the primary node were to
regularly push an OldestXMin into the WAL, the WAL apply process on
the standby nodes pushes it into shared memory and the backends keep
an OldestMasterXMin in shared memory. The standby nodes then regularly
pushes back the oldest OldestMasterXMin from all backends to the
master. Vacuum and HOT prune could then base their calculations on an
OldestXMin that is not the OldestXMin of the master itself, but of the
master and the standby nodes. That way removal of records that are
still visible on one of the standby nodes is prevented on the master
instead of worked around on the standby nodes.
The obvious downside would be bloat on the master (which could get out
of hand if a slave is a few days behind due to a long report), but I
think in terms of visibility and consistency this would work. Or am I
completely misunderstanding the problem?
Jochem
Csaba Nagy wrote:
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
One problem with this, BTW, is that if there's a continuous stream of
medium-length transaction in the slave, each new snapshot taken will
prevent progress in the WAL replay, so the WAL replay will advance in
"baby steps", and can fall behind indefinitely.Why would it fall behind indefinitely ? It only should fall behind to
the "blocking horizon", which should be the start of the longest
currently running transaction... which should be continually advancing
and not too far in the past if there are only medium length transactions
involved.
Well, yes, but you can fall behind indefinitely that way. Imagine that
each transaction on the slave lasts, say 10 minutes, with a new
transaction starting every 5 minutes. On the master, there's a table
that's being vacuumed (or HOT-updated) frequently, say after each
transaction for simplicity. What can happen is that every transaction
that finishes on the slave will only let the WAL replay advance by one
XID before blocking on the snapshot of the next slave transaction. The
WAL replay will advance at a rate of 0.2 TPM, while the master is
generating 1.0 TPM.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
I'd imagine that even if applying the WAL on the slave is blocked, it's
still streamed from the master to the slave, and in case of failover the
slave will fast-forward before starting up as the new master. Of course, if
it has fallen 3 days behind because of a giant reporting query, it can take
a while to replay all the WAL that has accumulated.
Yes, and meanwhile, any other queries that are running on that box are
seeing three-day old data as well.
In an ideal world, it would be nice if the slave could keep tuples
around even after they are dead and vacuumed on the master. Pushing
Xmin from the slave to the master creates the possibility of bloating
the master due to activity on the slave, which is not going to improve
reliability. OTOH, not pushing Xmin leads to several pathological
query behaviors on the slave: cancelled queries, inconsistent reads,
and falling way behind on WAL application. Either way, it seems to me
a massive and uncomfortable violation of the POLA.
If it were possible for tuples that had been vacuumed on the master to
stick around on the slave for as long as the slave still needed them,
then you'd have the best of both worlds, but I have a feeling
someone's going to say that that's just about impossible to implement.
Against that, all I can say is that neither of the behaviors
described thus far sounds very appealing as a feature, though I'm
certain there are some people who, with sufficient jiggering, could
make effective use of them.
...Robert
On Thu, 2008-09-11 at 16:19 +0300, Heikki Linnakangas wrote:
Well, yes, but you can fall behind indefinitely that way. Imagine that
each transaction on the slave lasts, say 10 minutes, with a new
transaction starting every 5 minutes. On the master, there's a table
that's being vacuumed (or HOT-updated) frequently, say after each
transaction for simplicity. What can happen is that every transaction
that finishes on the slave will only let the WAL replay advance by one
XID before blocking on the snapshot of the next slave transaction. The
WAL replay will advance at a rate of 0.2 TPM, while the master is
generating 1.0 TPM.
Aha, now I see where I was mistaken... I thought in terms of time and
not transaction IDs. So the time distance between the slave transactions
does not matter at all, only the distance in recovered XIDs matter for
the "blocking horizon"... and if the WAL recovery is blocked, the
"blocking horizon" is stalled as well, so the next transaction on the
slave will in fact require the same "blocking horizon" as all currently
running ones. Now I got it... and that means in fact that if you have
continuously overlapping small transactions, the "blocking horizon"
could be even blocked forever, as there'll always be a query running,
and the new queries will always have the snapshot of the currently
running ones because WAL recovery is stalled... or at least that's what
I understand from the whole thing...
Cheers,
Csaba.
Le jeudi 11 septembre 2008, Heikki Linnakangas a écrit :
Well, yes, but you can fall behind indefinitely that way. Imagine that
each transaction on the slave lasts, say 10 minutes, with a new
transaction starting every 5 minutes. On the master, there's a table
that's being vacuumed (or HOT-updated) frequently, say after each
transaction for simplicity. What can happen is that every transaction
that finishes on the slave will only let the WAL replay advance by one
XID before blocking on the snapshot of the next slave transaction. The
WAL replay will advance at a rate of 0.2 TPM, while the master is
generating 1.0 TPM.
What would forbid the slave to choose to replay all currently lagging WALs
each time it's given the choice to advance a little?
--
dim
On Thu, 2008-09-11 at 15:33 +0200, Dimitri Fontaine wrote:
What would forbid the slave to choose to replay all currently lagging WALs
each time it's given the choice to advance a little?
Well now that I think I understand what Heikki meant, I also think the
problem is that there's no choice at all to advance, because the new
queries will simply have the same snapshot as currently running ones as
long as WAL reply is blocked... further blocking the WAL reply. When
saying this I suppose that the snapshot is in fact based on the last
recovered XID, and not on any slave-local XID. In that case once WAL
recovery is blocked, the snapshot is stalled too, further blocking WAL
recovery, and so on...
Cheers,
Csaba.
Le jeudi 11 septembre 2008, Csaba Nagy a écrit :
Well now that I think I understand what Heikki meant, I also think the
problem is that there's no choice at all to advance, because the new
queries will simply have the same snapshot as currently running ones as
long as WAL reply is blocked... further blocking the WAL reply. When
saying this I suppose that the snapshot is in fact based on the last
recovered XID, and not on any slave-local XID. In that case once WAL
recovery is blocked, the snapshot is stalled too, further blocking WAL
recovery, and so on...
Well, it may be possible to instruct the WAL replay daemon to stop being
polite sometimes: when a given max_lag_delay is reached, it could take locks
and as soon as it obtains them, replay all remaining WAL.
The max_lag_delay would be a GUC allowing to set the threshold between
continuing the replay and running queries.
There could some other nice ideas without inventing yet another GUC, but this
one was eaiser to think about for me ;)
--
dim