Deriving Recovery Snapshots

Started by Simon Riggsover 17 years ago30 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

Recovery Snapshots are snapshots taken during recovery. They are valid
snapshots in all ways for testing visibility. Assembling the information
to allow snapshots to be taken operates differently in recovery than it
does in normal processing.

The first thing to realise is that in recovery the only knowledge of
what occurs is through WAL records. If it isn't in the WAL, we don't
know about it. Most of the time that also means we can ignore events
that we know occurred, for example data reads.

In order to build the recovery snapshot data we need to re-create events
from WAL data. In some cases we have to add new WAL records to ensure
that all possible information is present.

Each backend that existed on the master is represented by a PROC
structure in the ProcArray. These are known as "recovery procs" and are
similar to the dummy procs used for prepared transactions. All recovery
procs are "owned by" the Startup process. So there is no process for
which MyProc points to one of the recovery procs. This structure allows
us to record the top-level transactions and then put subtransactions in
the proc's subtransaction cache. A fixed one-to-one correspondence
allows efficient maintenance of the structures. We emulate all
transactional backend, including autovac.

So in Hot Standby mode we have one set of Recovery Procs emulating what
happened on the master, and another set running read only work.

We maintain information according to certain events on the master.
1. xid assignment (top-level xids)
2. xid assignment (subtransactions)
3. xid commit
4. xid subcommit
5. xid abort/subabort
6. backends which have FATAL errors but write no abort record.

(3) and (5) are easy because we already have WAL records for them.
For (3) we already updated clog from the WAL record, so we just need to
identify the proc and then set the xid.

(4) is completely removed by re-arranging subcommit so it is covered by
commit. (Atomic subxid patch)

(6) is a problem since this event can make transactions disappear
completely from the WAL record. If they crash immediately after Xid
assignment then they may crash without ever writing a WAL record at all.
We handle this in two ways.
* First, we keep a PROC for each backendid. Notice that means we keep a
PROC for each slot in the master's procarray, not for each pid. So if a
backend explodes and then someone reconnects using that same procarray
slot we will know that the previous transaction on that slot has
completed. This is a subtle but important point: without the ability to
infer certain transactions are complete we would need to keep track of a
potentially unlimited number of xids. Tying transactions to proc slots
means we will never have more than a fixed number of missing xids to
deal with.
* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

(1), (2) xid assignment doesn't appear in WAL. Writing WAL records for
each xid assigned would have a catastrophic effect on performance,
especially if we realise that we would have to do that while holding
XidGenLock. So we have to do lots of pressups to avoid it, in the
following ways:

We put a flag on the first WAL record written by a new transaction.
(Actually we mark the first WAL record containing the new xid, which
isn't always the first WAL record in the transaction. Weird, huh? Think
Oids, Storage files etc). We add an extra xid onto the WAL record to
hold the parent xid, and use that to maintain subtrans.

This works partially but not completely. It is possible for a
transaction to start a very large number of subtransactions before any
part of the transaction writes WAL. We only have space on the WAL record
for one additional xid. Each subxid must record its immediate parent's
xid in subtrans, so if we assign more than one *subtransaction* at a
time we *must* then write a WAL record for all the xids assigned apart
from the last one.

So that only affects transactions which use two or more subtransactions
in a transaction *and* who insist of starting subtransactions before
anything has been written, so not very common. So AssignTransactionId()
sometimes needs to write WAL records.

Another problem is that xids flagged on WAL records don't arrive in WAL
in the order they were assigned. So we must cope with out-of-order or
"unobserved xids". When we replay WAL, we keep track of UnobservedXids
in a shared memory array. These UnobservedXids are added onto any
recovery Snapshot taken iff they are earlier than latestCompletedXid. So
in the typical case, no xids will be added to the snapshots. For now, I
do all this work holding ProcArrayLock, but there seems scope to
optimise that also. Later.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

As a result of the AssignTransaction WAL records we know that each
backend will only ever allocate at most 2 xids before notifying WAL in
some way, either by flagging a WAL entry it makes or by making an entry
when assigning the new xids. As a result the UnobservedXids array will
never overflow if it has 2* MaxBackends entries. (I've added code, but
#ifdef'd it out).

Since UnobservedXids can never overflow, we also see that the Snapshot
can never overflow *because* of UnobservedXids. Each unobserved
top-level xid leaves space for 65 xids, yet we need only 2 to add the
unobserved xids.

I've had to change the way XidInMVCCSnapshot() works. We search the
snapshot even if it has overflowed. This is actually a performance win
in cases where only a few xids have overflowed but most haven't. This is
essential because if we were forced to check in subtrans *and*
unobservedxids existed then the snapshot would be invalid. (I could have
made it this way *just* in recovery, but the change seems better both
ways).

So that's how we maintain info required for Snapshots, but the next part
of the puzzle is how we handle the initial state. Again, subtransactions
are a pain because there can be an extremely large number of them. So
taking a snapshot and copying it to WAL is insufficient. We handle this
by taking a snapshot when we have performed pg_start_backup() (in the
checkpoint we already take) and then taking another snapshot after each
checkpoint. Doing it that way means wherever we restart from we always
have an initial state record close to hand. On the standby, if the first
snapshot we see has overflowed then we either wait for a snapshot to
arrive which has not overflowed. (We could also wait for a snapshot
whose xmin is later than the xmax of our first snapshot).

This means that there could be a delay in starting Hot Standby mode *if*
we are heavily using subtransactions at the time we take backup.

So overheads of the patch are:
* WAL record extended to completely fill 8-byte alignment; extra 4 bytes
per record on 4-byte alignment. Additional items are:
uint16 xl_info2;
TransactionId xl_xid2;
This takes no additional space on 64-bit servers because of previous
wastage.

* AssignTransactionId must WAL log xid assignment when making multiple
assignments.

* bgwriter writes Snapshot data to WAL after each checkpoint.

* additional shared memory:
2 * MaxBackends * sizeof(TransactionId) for UnobservedXids
1 * MaxBackends * sizeof(PGPROC) for RecoveryProcs

* additional processing time during recovery to maintain snapshot info

In current patch I've put the slotid and flag bits into uint16. That
means we can manage up to 4096 connections without writing any
additional WAL data. Beyond that we need to write a WAL record for each
AssignTransactionId(), plus add 4 bytes onto each Commit/Abort record.

Note that because of the atomic subxids changes we actually write fewer
WAL records in most cases than we did before and they occupy the same
space they did before.

I'll post patch tomorrow and at least weekly after this.

Patch footprint looks like this prior to cleanup.

backend/access/transam/varsup.c | 52 -!
backend/access/transam/xact.c | 559 ++++++++++++++++++++++!!!!!
backend/access/transam/xlog.c | 49 +-
backend/postmaster/bgwriter.c | 11
backend/storage/ipc/procarray.c | 721 +++++++++++++++++++++++++++++!!!
backend/storage/lmgr/proc.c | 107 +++++
backend/utils/time/tqual.c | 27 !
include/access/heapam.h | 2
include/access/htup.h | 2
include/access/transam.h | 2
include/access/xact.h | 23 +
include/access/xlog.h | 44 +!
include/access/xlog_internal.h | 2
include/catalog/pg_control.h | 3
include/storage/proc.h | 4
include/storage/procarray.h | 14
include/utils/snapshot.h | 65 +++
17 files changed, 1432 insertions(+), 44 deletions(-), 211 mods(!)

Your comments are welcome, especially questions and thoughts around the
correctness of the approach. Lots more comments in patch.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#2Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#1)
Re: Deriving Recovery Snapshots

On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:

I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

How will this interact with an idea like this?:
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php

I've had to change the way XidInMVCCSnapshot() works. We search the
snapshot even if it has overflowed. This is actually a performance win
in cases where only a few xids have overflowed but most haven't. This is
essential because if we were forced to check in subtrans *and*
unobservedxids existed then the snapshot would be invalid. (I could have
made it this way *just* in recovery, but the change seems better both
ways).

I don't entirely understand this. Can you explain the situation that
would result in an invalid snapshot?

Regards,
Jeff Davis

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#2)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-15 at 12:58 -0700, Jeff Davis wrote:

On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:

I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

How will this interact with an idea like this?:
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php

pg_snapclone should work fine, since it is orthogonal to this work.

I've had to change the way XidInMVCCSnapshot() works. We search the
snapshot even if it has overflowed. This is actually a performance win
in cases where only a few xids have overflowed but most haven't. This is
essential because if we were forced to check in subtrans *and*
unobservedxids existed then the snapshot would be invalid. (I could have
made it this way *just* in recovery, but the change seems better both
ways).

I don't entirely understand this. Can you explain the situation that
would result in an invalid snapshot?

In recovery the snapshot consists of two sets of xids:
* ones we have seen as running e.g. xid=43
* ones we know exist, but haven't seen yet (e.g. xid=42)
(I call this latter kind Unobserved Transactions).

Both kinds of xids *must* be in the snapshot for MVCC to work.

The current way of checking snapshots is to say "if *any* of the running
transactions has overflowed, check subtrans".

Unobserved transactions are not in subtrans, so if you checked for them
there you would fail to find them. Currently we assume that means it is
a top-level transaction and then check the top-level xids.

Why are unobserved transactions not in subtrans? Because they are
unobserved, so we can't assign their parent xid. (By definition, because
they are unobserved).

There isn't always enough space in the snapshot to allow all the
unobserved xids to be added as if they were top-level transactions, so
we put them into the subtrans cache as a secondary location and then
change the algorithm in XidInMVCCSnapshot(). We don't want to increase
the size of the snapshot because it already contains wasted space in
subtrans cache, nor do we wish to throw errors when people try to take
snapshots.

The XidInMVCCSnapshot() changes make sense of themselves for most cases,
since we don't want one transaction to cause us to thrash subtrans, as
happened in 8.1.

This took me some time to think through...

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#1)
Re: Deriving Recovery Snapshots

On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:

I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

OK, here's the latest patch. Found a bug late last night, fixed now.

This patch derives snapshot data during recovery. It contains all code
to write and read back WAL changes etc.. It is the "main patch" for Hot
Standby. Everything else hangs around this, enhances it or puts in
additional checks.

Having said that, this patch doesn't let you connect and run queries.
I've written this part as a standalone patch, for easier reviewing. So
it is still a "WIP" patch.

Patch can be tested by running a workload and then stop server, -m
immediate and then watching the replay log (or gdb). Successful tested
with some custom subtransaction scripts and lots of tests using make
installcheck-parallel, then crash recovery.

diffstat
backend/access/transam/slru.c | 16
backend/access/transam/twophase.c | 2
backend/access/transam/xact.c | 664 +++++++++++++++++++!!!!!!!!
backend/access/transam/xlog.c | 58 +-
backend/storage/ipc/procarray.c | 781 +++++++++++++++++++++++++++!!!
backend/storage/lmgr/proc.c | 107 +++++
backend/utils/time/tqual.c | 27 !
include/access/xact.h | 26 +
include/access/xlog.h | 44 +!
include/access/xlog_internal.h | 2
include/catalog/pg_control.h | 3
include/storage/proc.h | 4
include/storage/procarray.h | 17
include/utils/snapshot.h | 65 +++
14 files changed, 1541 insertions(+), 19 deletions(-), 256 mods(!)

Prepared transactions do not yet work correctly, but the principles are
the same so I expect this to be done in a similar way. Any input
welcome.

Other related patches are
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
They don't all apply cleanly together, but the changes are unrelated, so
those patches can still be reviewed without wasting energy.

Next phase is connecting and running queries, next few days. That will
probably shake out a few more bugs from this code.

Comments welcome.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

hot_standby_snapshot.v1.patchtext/x-patch; charset=utf-8; name=hot_standby_snapshot.v1.patchDownload+1797-275
#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#4)
Re: Deriving Recovery Snapshots

On Thu, 2008-10-16 at 13:55 +0100, Simon Riggs wrote:

Other related patches are
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
They don't all apply cleanly together, but the changes are unrelated, so
those patches can still be reviewed without wasting energy.

Next phase is connecting and running queries, next few days. That will
probably shake out a few more bugs from this code.

I've integrated my five patches together into one now:
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
* hs_connect
* hs_checks
* hs_snapshot

Seems positive that it all integrated so quickly and tests OK.
More later.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#1)
Re: Deriving Recovery Snapshots

Simon Riggs wrote:

Each backend that existed on the master is represented by a PROC
structure in the ProcArray. These are known as "recovery procs" and are
similar to the dummy procs used for prepared transactions. All recovery
procs are "owned by" the Startup process. So there is no process for
which MyProc points to one of the recovery procs. This structure allows
us to record the top-level transactions and then put subtransactions in
the proc's subtransaction cache. A fixed one-to-one correspondence
allows efficient maintenance of the structures. We emulate all
transactional backend, including autovac.

We'll need to know the max_connections setting in the master, in order
to size the array correctly. Not a show-stopper, but would be nicer if
we didn't need to.

* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should
be enough.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

If we're going to have an UnobservedXids array, why don't we just treat
all in-progress transactions as Unobserved, and forget about the dummy
PROC entries?

Also, I can't help thinking that this would be a lot simpler if we just
treated all subtransactions the same as top-level transactions. The only
problem with that is that there can be a lot of subtransactions, which
means that we'd need a large UnobservedXids array to handle the worst
case, but maybe it would still be acceptable?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#5)
Re: Deriving Recovery Snapshots

On Thu, 2008-10-16 at 15:20 +0100, Simon Riggs wrote:

I've integrated my five patches together into one now:
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
* hs_connect
* hs_checks
* hs_snapshot

Seems positive that it all integrated so quickly and tests OK.
More later.

Wahoo! Worked first time.

postgres=# select count(*) from branches;
count
-------
1
(1 row)

That's with archive recovery waiting for next WAL file, so not applying
new changes or holding locks.

So that's the good news. Still have much more work to make everything
work safely while WAL apply happens. And the shooting hasn't started yet
either. So there's light, but nowhere near the end of the tunnel.

So looks like I'll be shipping first combined patch tomorrow.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#6)
Re: Deriving Recovery Snapshots

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

Each backend that existed on the master is represented by a PROC
structure in the ProcArray. These are known as "recovery procs" and are
similar to the dummy procs used for prepared transactions. All recovery
procs are "owned by" the Startup process. So there is no process for
which MyProc points to one of the recovery procs. This structure allows
us to record the top-level transactions and then put subtransactions in
the proc's subtransaction cache. A fixed one-to-one correspondence
allows efficient maintenance of the structures. We emulate all
transactional backend, including autovac.

We'll need to know the max_connections setting in the master, in order
to size the array correctly. Not a show-stopper, but would be nicer if
we didn't need to.

Yes. We'll probably need to add checks/configurability later. Unless you
have a way...

* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should
be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

With default settings that is at most 25600 bytes for subxid cache, plus
a maybe 2000 bytes for other info. For most cases, we will use less than
1 wal buffer.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

If we're going to have an UnobservedXids array, why don't we just treat
all in-progress transactions as Unobserved, and forget about the dummy
PROC entries?

That's a good question and I expected some debate on that.

The main problem is fatal errors that don't write abort records. By
reusing the PROC entries we can keep those to a manageable limit. If we
don't have that, the number of fatal errors could cause that list to
grow uncontrollably and we might overflow any setting, causing snapshots
to stall and new queries to hang. We really must have a way to place an
upper bound on the number of unobserved xacts. So we really need the
proc approach. But we also need the UnobservedXids array.

It's definitely more code to have both, so I would not have chosen that
route if there was another way. The simple approach just doesn't cover
all possible cases, and we need to cover them all.

Having only an UnobservedXid array was my first thought and I said
earlier I would do it without using procs. Bad idea. Using the
UnobservedXids array means every xact removal requires a bsearch,
whereas with procs we can do a direct lookup, removing all xids in one
stroke. Much better for typical cases. Also, if we have procs we can use
the "no locks" approach in some cases, as per current practice on new
xid insertions.

Also, I can't help thinking that this would be a lot simpler if we just
treated all subtransactions the same as top-level transactions. The only
problem with that is that there can be a lot of subtransactions, which
means that we'd need a large UnobservedXids array to handle the worst
case, but maybe it would still be acceptable?

Yes, you see the problem. Without subtransactions, this would be a
simple issue to solve.

In one sense, I do as you say. When we make a snapshot we stuff the
UnobservedXids into the snapshot *somewhere*. We don't know whether they
are top level or subxacts. But we need a solution for when we run out of
top-level xid places in the snapshot. Which has now been provided,
luckily.

If we have no upper bound on snapshot size then *all* backends would
need a variable size snapshot. We must solve that problem or accept
having people wait maybe minutes for a snapshot in worst case. I've
found one way of placing a bound on the number of xids we need to keep
in the snapshot. If there is another, better way of keeping it bounded I
will happily adopt it. I spent about 2 weeks sweating this issue...

I'm available tomorrow to talk in real time if there's people in the Dev
room at PGday want to discuss this, or have me explain the patch(es).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#8)
Re: Deriving Recovery Snapshots

Simon Riggs wrote:

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should
be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record,
rather than a full snapshot, no? And to read it off the checkpoint record.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

If we're going to have an UnobservedXids array, why don't we just treat
all in-progress transactions as Unobserved, and forget about the dummy
PROC entries?

That's a good question and I expected some debate on that.

The main problem is fatal errors that don't write abort records. By
reusing the PROC entries we can keep those to a manageable limit. If we
don't have that, the number of fatal errors could cause that list to
grow uncontrollably and we might overflow any setting, causing snapshots
to stall and new queries to hang. We really must have a way to place an
upper bound on the number of unobserved xacts. So we really need the
proc approach. But we also need the UnobservedXids array.

If you write the oldestxmin (or a full snapshot, including the
oldestxmin) to each checkpoint record, you can crop out any unobserved
xids older than that, when you replay the checkpoint record.

Having only an UnobservedXid array was my first thought and I said
earlier I would do it without using procs. Bad idea. Using the
UnobservedXids array means every xact removal requires a bsearch,
whereas with procs we can do a direct lookup, removing all xids in one
stroke. Much better for typical cases.

How much does that really matter? Under normal circumstances, the array
would be quite small anyway. A bsearch of a relatively small array isn't
that expensive. Or a hash table, so that removing/inserting items
doesn't need to shift all the following entries.

Also, I can't help thinking that this would be a lot simpler if we just
treated all subtransactions the same as top-level transactions. The only
problem with that is that there can be a lot of subtransactions, which
means that we'd need a large UnobservedXids array to handle the worst
case, but maybe it would still be acceptable?

Yes, you see the problem. Without subtransactions, this would be a
simple issue to solve.

In one sense, I do as you say. When we make a snapshot we stuff the
UnobservedXids into the snapshot *somewhere*. We don't know whether they
are top level or subxacts. But we need a solution for when we run out of
top-level xid places in the snapshot. Which has now been provided,
luckily.

If we have no upper bound on snapshot size then *all* backends would
need a variable size snapshot. We must solve that problem or accept
having people wait maybe minutes for a snapshot in worst case. I've
found one way of placing a bound on the number of xids we need to keep
in the snapshot. If there is another, better way of keeping it bounded I
will happily adopt it. I spent about 2 weeks sweating this issue...

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup
process knows to be subtransactions and knows the parents, and update
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the
flag. If it hasn't overflowed, a transaction is considered to be in
progress if it's in the array. If it has overflowed, and the xid is not
in the array, check subtrans

Note that the startup process sees all WAL records, so it can do
arbitrarily complex bookkeeping in backend-private memory, and only
expose the necessary parts in shared mem. For example, it can keep track
of the parent-child relationships of the xids in UnobservedXids, but the
backends taking snapshots don't need to know about that. For step 2 to
work, that's exactly what the startup process needs to keep track of.

For the startup process to know about the parent-child relationships,
we'll need something like WAL changes you suggested. I'm not too
thrilled about adding a new field to all WAL records. Seems simpler to
just rely on the new WAL records on AssignTransactionId(), and we can
only do it, say, every 100 subtransactions, if we make the
UnobservedXids array big enough (100*max_connections).

This isn't actually that different from your proposal. The big
difference is that instead of PROC entries and UnobservedXids, all
transactions are tracked in UnobservedXids, and instead of caching
subtransactions in the subxids array in PROC entries, they're cached in
UnobservedXids as well.

Aanother, completely different approach, would be to forget about xid
arrays altogether, and change the way snapshots are taken: just do a
full memcpy of the clog between xmin and xmax. That might be pretty slow
if xmax-xmin is big, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#6)
Re: Deriving Recovery Snapshots

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Also, I can't help thinking that this would be a lot simpler if we just
treated all subtransactions the same as top-level transactions. The only
problem with that is that there can be a lot of subtransactions, which
means that we'd need a large UnobservedXids array to handle the worst
case, but maybe it would still be acceptable?

I remember cases on this list where long transactions did run out of
subtransaction ids. To accommodate something approacing that we need an
array for storing (max_connections * 4G ) UnobservedXids instead of just
max_connections.

-----------------
Hannu

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#9)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should
be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record,
rather than a full snapshot, no? And to read it off the checkpoint record.

You may be missing my point.

We need an initial state to work from.

I am proposing we write a full snapshot after each checkpoint because it
allows us to start recovery again from that point. If we wrote only
OldestXmin as you suggest it would optimise the size of the WAL record
but it would prevent us from restarting at that point.

Also, passing OldestXmin only would not work in the presence of long
running statements. Passing the snapshot allows us to see that FATAL
errors have occurred much sooner.

BTW, the way I have coded it means that if we skip writing a checkpoint
on a quiet system then we would also skip writing the snapshot.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#9)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup
process knows to be subtransactions and knows the parents, and update
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the
flag. If it hasn't overflowed, a transaction is considered to be in
progress if it's in the array. If it has overflowed, and the xid is not
in the array, check subtrans

We can't check subtrans. We do not have any record of what the parent is
for an unobserved transaction id. So the complete list of unobserved
xids *must* be added to the snapshot. If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

Note that the startup process sees all WAL records, so it can do
arbitrarily complex bookkeeping in backend-private memory, and only
expose the necessary parts in shared mem. For example, it can keep track
of the parent-child relationships of the xids in UnobservedXids, but the
backends taking snapshots don't need to know about that. For step 2 to
work, that's exactly what the startup process needs to keep track of.

For the startup process to know about the parent-child relationships,
we'll need something like WAL changes you suggested. I'm not too
thrilled about adding a new field to all WAL records. Seems simpler to
just rely on the new WAL records on AssignTransactionId(), and we can
only do it, say, every 100 subtransactions, if we make the
UnobservedXids array big enough (100*max_connections).

Yes, we can make the UnobservedXids array bigger, but only to the point
where it will all fit within a snapshot.

The WAL changes proposed use space that was previously wasted, so there
is no increase in amount of data going to disk. The additional time to
derive that data is very quick when those fields are unused and that
logic is executed before we take WALInsertLock. So overall, very low
overhead.

Every new subxid needs to specify its parent's xid. We must supply that
information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have
done in most cases, tuck that into the wasted space on the xlrec.
Writing a WAL record every 100 subtransactions will not work: we need to
write to subtrans *before* that xid appears anywhere on disk, so that
visibility tests can determine the status of the transaction.

The approach I have come up with is very finely balanced. It's the
*only* approach that I've come up with that covers all requirements;
there were very few technical choices to make. If it wasn't for
subtransactions, disappearing transactions because of FATAL errors and
unobserved xids it would be much simpler. But having said that, the code
isn't excessively complex, I wrote it in about 3 days.

This isn't actually that different from your proposal. The big
difference is that instead of PROC entries and UnobservedXids, all
transactions are tracked in UnobservedXids, and instead of caching
subtransactions in the subxids array in PROC entries, they're cached in
UnobservedXids as well.

Aanother, completely different approach, would be to forget about xid
arrays altogether, and change the way snapshots are taken: just do a
full memcpy of the clog between xmin and xmax. That might be pretty slow
if xmax-xmin is big, though.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Deriving Recovery Snapshots

Simon Riggs wrote:

On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup
process knows to be subtransactions and knows the parents, and update
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the
flag. If it hasn't overflowed, a transaction is considered to be in
progress if it's in the array. If it has overflowed, and the xid is not
in the array, check subtrans

We can't check subtrans. We do not have any record of what the parent is
for an unobserved transaction id. So the complete list of unobserved
xids *must* be added to the snapshot. If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

That's why we still need the occasional WAL logging in
AssignTransactionId(). To log the parent-child relationships of the
subtransactions.

For the startup process to know about the parent-child relationships,
we'll need something like WAL changes you suggested. I'm not too
thrilled about adding a new field to all WAL records. Seems simpler to
just rely on the new WAL records on AssignTransactionId(), and we can
only do it, say, every 100 subtransactions, if we make the
UnobservedXids array big enough (100*max_connections).

Yes, we can make the UnobservedXids array bigger, but only to the point
where it will all fit within a snapshot.

The list of xids in a snapshot is just a palloc'd array, in
backend-local memory, so we can easily make it as large as we need to.

Every new subxid needs to specify its parent's xid. We must supply that
information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have
done in most cases, tuck that into the wasted space on the xlrec.
Writing a WAL record every 100 subtransactions will not work: we need to
write to subtrans *before* that xid appears anywhere on disk, so that
visibility tests can determine the status of the transaction.

I don't follow. It doesn't need to be in subtrans before it appears on
disk, AFAICS. It can be stored in UnobservedXids at first, and when it
overflows, we can update subtrans and remove the entries from
UnobservedXids. A snapshot taken before the overflow will have the
subxid in its copy of UnobservedXids, and one taken after overflow will
find it in subtrans.

If UnobservedXids is large enough to hold, say 100 * max_connections
xids, by writing a WAL record containing the parent-child relationships
every 100 assigned subtransactions within a top-level transaction, the
top-level transactions and those subtransactions that we don't know the
parent of will always fit into UnobservedXids.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#11)
Re: Deriving Recovery Snapshots

Simon Riggs wrote:

On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should
be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record,
rather than a full snapshot, no? And to read it off the checkpoint record.

You may be missing my point.

We need an initial state to work from.

I am proposing we write a full snapshot after each checkpoint because it
allows us to start recovery again from that point. If we wrote only
OldestXmin as you suggest it would optimise the size of the WAL record
but it would prevent us from restarting at that point.

Well, you'd just need to treat anything > oldestxmin, and not marked as
finished in clog, as unobserved. Which doesn't seem too bad. Not that
storing the full list of in-progress xids is that bad either, though.

Hmm. What about in-progress subtransactions that have overflowed the
shared mem cache? Can we rely that subtrans is up-to-date, up to the
checkpoint record that recovery starts from?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Hannu Krosing (#10)
Re: Deriving Recovery Snapshots

Hannu Krosing wrote:

On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:

Also, I can't help thinking that this would be a lot simpler if we just
treated all subtransactions the same as top-level transactions. The only
problem with that is that there can be a lot of subtransactions, which
means that we'd need a large UnobservedXids array to handle the worst
case, but maybe it would still be acceptable?

I remember cases on this list where long transactions did run out of
subtransaction ids. To accommodate something approacing that we need an
array for storing (max_connections * 4G ) UnobservedXids instead of just
max_connections.

You can't have more than 4G (or 2G?) active subtransactions running in a
system, because you will simply run out of transaction ids and hit xid
wrap-around after that. So in the worst-case, you don't need space for
(max_connections * 4G) xids, just 4G. That's still a large number, of
course.

In situations like that, a bitmap, like clog, instead of an array, would
be more space efficient. But that's less efficient in the more common
case that there's few in-progress transactions, but some of them are
very old.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:

It doesn't need to be in subtrans before it appears on
disk, AFAICS.

I don't make these rules. Read comments in AssignTransactionId().

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Deriving Recovery Snapshots

Simon Riggs wrote:

If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

BTW, we'll need to do that anyway, if we guess the max_connections
setting in the master incorrectly. We should avoid having to do that
under other circumstances, of course.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:

The list of xids in a snapshot is just a palloc'd array, in
backend-local memory, so we can easily make it as large as we need to.

It's malloc'd before we hold the lock. Yes, we can make it as large as
we need to, but that would change the way it is allocated significantly
and that will require either extended lock hold durations or multiple
lock acquisitions. That doesn't sounds good.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#14)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 17:29 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

You may be missing my point.

We need an initial state to work from.

I am proposing we write a full snapshot after each checkpoint because it
allows us to start recovery again from that point. If we wrote only
OldestXmin as you suggest it would optimise the size of the WAL record
but it would prevent us from restarting at that point.

Well, you'd just need to treat anything > oldestxmin, and not marked as
finished in clog, as unobserved. Which doesn't seem too bad. Not that
storing the full list of in-progress xids is that bad either, though.

As I said, I'm storing the whole list to give me an initial state.

Hmm. What about in-progress subtransactions that have overflowed the
shared mem cache? Can we rely that subtrans is up-to-date, up to the
checkpoint record that recovery starts from?

Yes, we can't use snapshots as an initial state if they have overflowed.
In that case we could avoid dropping full snapshot to WAL, yes. That's
probably trivial to implement.

Currently, the patch can start from a non-overflowed snapshot, but it
could also start from an overflowed snapshot and wait until we receive a
snapshot with an xmin > the xmax of the first snapshot. That seemed too
much icing for the first patch. I think we need to all agree on the
correctness aspects before we attempt too much tuning and tweaking.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#17)
Re: Deriving Recovery Snapshots

On Wed, 2008-10-22 at 17:40 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

BTW, we'll need to do that anyway, if we guess the max_connections
setting in the master incorrectly. We should avoid having to do that
under other circumstances, of course.

I'm sure it's easier to handle that in easier ways. I imagined we would
just issue a message saying "must increase max_connections" and then
shutdown. It's easy enough to document that if you change the master you
must also change the slave - it's not a very common change after all.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#16)
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#24)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#24)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#26)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#27)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#25)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#29)