cheaper snapshots
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all. I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time). Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?". You'd need some index datastructure that made
it reasonably cheap to find out the commit locations of recently
committed transactions, where "recent" means "back to recentGlobalXmin".
That seems possibly do-able, though I don't have a concrete design in
mind.
I was mulling this idea over some more (the same ideas keep floating
back to the top...). I don't think an LSN can actually work, because
there's no guarantee that the order in which the WAL records are
emitted is the same order in which the effects of the transactions
become visible to new snapshots. For example:
1. Transaction A inserts its commit record, flushes WAL, and begins
waiting for sync rep.
2. A moment later, transaction B sets synchronous_commit=off, inserts
its commit record, requests a background WAL flush, and removes itself
from the ProcArray.
3. Transaction C takes a snapshot.
Sync rep doesn't create this problem; there's a race anyway. The
order of acquisition for WALInsertLock needn't match that for
ProcArrayLock. This has the more-than-slightly-odd characteristic
that you could end up with a snapshot on the master that can see A but
not B and a snapshot on the slave that can see B but not A.
But having said that an LSN can't work, I don't see why we can't just
use a 64-bit counter. In fact, the predicate locking code already
does something much like this, using an SLRU, for serializable
transactions only. In more detail, what I'm imagining is an array
with 4 billion entries, one per XID, probably broken up into files of
say 16MB each with 2 million entries per file. Each entry is a 64-bit
value. It is 0 if the XID has not yet started, is still running, or
has aborted. Otherwise, it is the commit sequence number of the
transaction. For reasons I'll explain below, I'm imagining starting
the commit sequence number counter at some very large value and having
it count down from there. So the basic operations are:
- To take a snapshot, you just read the counter.
- To commit a transaction which has an XID, you read the counter,
stamp all your XIDs with that value, and decrement the counter.
- To find out whether an XID is visible to your snapshot, you look up
the XID in the array and get the counter value. If the value you read
is greater than your snapshot value, it's visible. If it's less, it's
not.
Now, is this algorithm any good, and how little locking can we get away with?
It seems to me that if we used an SLRU to store the array, the lock
contention would be even worse than it is under our current system,
wherein everybody fights over ProcArrayLock. A system like this is
going to involve lots and lots of probes into the array (even if we
build a per-backend cache of some kind) and an SLRU will require at
least one LWLock acquire and release per probe. Some kind of locking
is pretty much unavoidable, because you have to worry about pages
getting evicted from shared memory. However, what if we used a set of
files (like SLRU) but mapped them separately into each backend's
address space? I think this would allow both loads and stores from
the array to be done unlocked. One fly in the ointment is that 8-byte
stores are apparently done as two 4-byte stores on some platforms.
But if the counter runs backward, I think even that is OK. If you
happen to read an 8 byte value as it's being written, you'll get 4
bytes of the intended value and 4 bytes of zeros. The value will
therefore appear to be less than what it should be. However, if the
value was in the midst of being written, then it's still in the midst
of committing, which means that that XID wasn't going to be visible
anyway. Accidentally reading a smaller value doesn't change the
answer.
Committing will require a lock on the counter.
Taking a snapshot can be done unlocked if (1) 8-byte reads are atomic
and either (2a) the architecture has strong memory ordering (no
store/store reordering) or (2b) you insert a memory fence between
stamping the XIDs and decrementing the counter. Otherwise, taking a
snapshot will also require a lock on the counter.
Once a particular XID precedes RecentGlobalXmin, you no longer care
about the associated counter value. You just need to know that it
committed; the order no longer matters. So after a crash, assuming
that you have the CLOG bits available, you can just throw away all the
array contents and start the counter over at the highest possible
value. And, as RecentGlobalXmin advances, you can prune away (or
recycle) files that are no longer needed. In fact, you could put all
of these files on a ramfs, or maybe use shared memory segments for
them.
All that having been said, even if I haven't made any severe
conceptual errors in the above, I'm not sure how well it will work in
practice. On the plus side, taking a snapshot becomes O(1) rather
than O(MaxBackends) - that's good. On the further plus side, you can
check both whether an XID has committed and whether it's visible to
your snapshot in a single, atomic action with no lock - that seems
really good. On the minus side, checking an xid against your snapshot
now has less locality of reference. And, rolling over into a new
segment of the array is going to require everyone to map it, and maybe
cause some disk I/O as a new file gets created.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 3:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
All that having been said, even if I haven't made any severe
conceptual errors in the above, I'm not sure how well it will work in
practice. On the plus side, taking a snapshot becomes O(1) rather
than O(MaxBackends) - that's good. On the further plus side, you can
check both whether an XID has committed and whether it's visible to
your snapshot in a single, atomic action with no lock - that seems
really good. On the minus side, checking an xid against your snapshot
now has less locality of reference. And, rolling over into a new
segment of the array is going to require everyone to map it, and maybe
cause some disk I/O as a new file gets created.
Sounds like the right set of thoughts to be having.
If you do this, you must cover subtransactions and Hot Standby. Work
in this area takes longer than you think when you take the
complexities into account, as you must.
I think you should take the premise of making snapshots O(1) and look
at all the ways of doing that. If you grab too early at a solution you
may grab the wrong one.
For example, another approach would be to use a shared hash table.
Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
N can be kept small by regularly pruning the hash table. If we crash
we lose the hash table - no matter. (I'm not suggesting this is
better, just a different approach that should be judged across
others).
What I'm not sure in any of these ideas is how to derive a snapshot xmin.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Jul28, 2011, at 04:51 , Robert Haas wrote:
One fly in the ointment is that 8-byte
stores are apparently done as two 4-byte stores on some platforms.
But if the counter runs backward, I think even that is OK. If you
happen to read an 8 byte value as it's being written, you'll get 4
bytes of the intended value and 4 bytes of zeros. The value will
therefore appear to be less than what it should be. However, if the
value was in the midst of being written, then it's still in the midst
of committing, which means that that XID wasn't going to be visible
anyway. Accidentally reading a smaller value doesn't change the
answer.
That only works if the update of the most-significant word is guaranteed
to be visible before the update to the lest-significant one. Which
I think you can only enforce if you update the words individually
(and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
compiler.
Otherwise, the following might happen (with a 2-byte value instead of an
8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
ones aren't. Just to keep the numbers smaller. The machine is assumed to be
big-endian)
The counter is at 0xff00
Backends 1 decrements, i.e. does
(1) STORE [counter+1] 0xff
(2) STORE [counter], 0x00
Backend 2 reads
(1') LOAD [counter+1]
(2') LOAD [counter]
If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
0xffff which is higher than it should be.
But we could simply use a spin-lock to protect the read on machines where
we don't know for sure that 64-bit reads and write are atomic. That'll
only really hurt on machines with 16+ cores or so, and the number of
architectures which support that isn't that high anyway. If we supported
spinlock-less operation on SPARC, x86-64, PPC64 and maybe Itanium, would we
miss any important one?
best regards,
Florian Pflug
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all. I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time). Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?".
Why not just cache a "reference snapshots" near WAL writer and maybe
also save it at some interval in WAL in case you ever need to restore an
old snapshot at some val position for things like time travel.
It may be cheaper lock-wise not to update ref. snapshot at each commit,
but to keep latest saved snapshot and a chain of transactions
committed / aborted since. This means that when reading the snapshot you
read the current "saved snapshot" and then apply the list of commits.
when moving to a new saved snapshot you really generate a new one and
keep the old snapshot + commit chain around for a little while for those
who may be still processing it. Seems like this is something that can be
done with no locking,
You'd need some index datastructure that made
it reasonably cheap to find out the commit locations of recently
committed transactions, where "recent" means "back to recentGlobalXmin".
That seems possibly do-able, though I don't have a concrete design in
mind.
snapshot + chain of commits is likely as cheap as it gets, unless you
additionally cache the commits in a tighter data structure. this is
because you will need them all anyway to compute difference from ref
snapshot.
--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Sounds like the right set of thoughts to be having.
Thanks.
If you do this, you must cover subtransactions and Hot Standby. Work
in this area takes longer than you think when you take the
complexities into account, as you must.
Right. This would replace the KnownAssignedXids stuff (a non-trivial
project, I am sure).
I think you should take the premise of making snapshots O(1) and look
at all the ways of doing that. If you grab too early at a solution you
may grab the wrong one.
Yeah, I'm just brainstorming at this point. This is, I think, the
best idea of what I've come up with so far, but it's definitely not
the only approach.
For example, another approach would be to use a shared hash table.
Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
N can be kept small by regularly pruning the hash table. If we crash
we lose the hash table - no matter. (I'm not suggesting this is
better, just a different approach that should be judged across
others).
Sorry, I'm having a hard time understanding what you are describing
here. What would the keys and values in this hash table be, and what
do k and N refer to here?
What I'm not sure in any of these ideas is how to derive a snapshot xmin.
That is a problem. If we have to scan the ProcArray every time we
take a snapshot just to derive an xmin, we are kind of hosed. One
thought I had is that we might be able to use a sort of sloppy xmin.
In other words, we keep a cached xmin, and have some heuristic where
we occasionally try to update it. A snapshot with a too-old xmin
isn't wrong, just possibly slower. But if xmin is only slightly stale
and xids can be tested relatively quickly, it might not matter very
much.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 4:16 AM, Florian Pflug <fgp@phlo.org> wrote:
On Jul28, 2011, at 04:51 , Robert Haas wrote:
One fly in the ointment is that 8-byte
stores are apparently done as two 4-byte stores on some platforms.
But if the counter runs backward, I think even that is OK. If you
happen to read an 8 byte value as it's being written, you'll get 4
bytes of the intended value and 4 bytes of zeros. The value will
therefore appear to be less than what it should be. However, if the
value was in the midst of being written, then it's still in the midst
of committing, which means that that XID wasn't going to be visible
anyway. Accidentally reading a smaller value doesn't change the
answer.That only works if the update of the most-significant word is guaranteed
to be visible before the update to the lest-significant one. Which
I think you can only enforce if you update the words individually
(and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
compiler.Otherwise, the following might happen (with a 2-byte value instead of an
8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
ones aren't. Just to keep the numbers smaller. The machine is assumed to be
big-endian)The counter is at 0xff00
Backends 1 decrements, i.e. does
(1) STORE [counter+1] 0xff
(2) STORE [counter], 0x00Backend 2 reads
(1') LOAD [counter+1]
(2') LOAD [counter]If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
0xffff which is higher than it should be.
You're confusing two different things - I agree that you need a
spinlock around reading the counter, unless 8-byte loads and stores
are atomic.
What I'm saying can be done without a lock is reading the commit-order
value for a given XID. If that's in the middle of being updated, then
the old value was zero, so the scenario you describe can't occur.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all. I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time). Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?".Why not just cache a "reference snapshots" near WAL writer and maybe
also save it at some interval in WAL in case you ever need to restore an
old snapshot at some val position for things like time travel.It may be cheaper lock-wise not to update ref. snapshot at each commit,
but to keep latest saved snapshot and a chain of transactions
committed / aborted since. This means that when reading the snapshot you
read the current "saved snapshot" and then apply the list of commits.
Yeah, interesting idea. I thought about that. You'd need not only
the list of commits but also the list of XIDs that had been published,
since the commits have to be removed from the snapshot and the
newly-published XIDs have to be added to it (in case they commit later
while the snapshot is still in use).
You can imagine doing this with a pair of buffers. You write a
snapshot into the beginning of the first buffer and then write each
XID that is published or commits into the next slot in the array.
When the buffer is filled up, the next process that wants to publish
an XID or commit scans through the array and constructs a new snapshot
that compacts away all the begin/commit pairs and writes it into the
second buffer, and all new snapshots are taken there. When that
buffer fills up you flip back to the first one. Of course, you need
some kind of synchronization to make sure that you don't flip back to
the first buffer while some laggard is still using it to construct a
snapshot that he started taking before you flipped to the second one,
but maybe that could be made light-weight enough not to matter.
I am somewhat concerned that this approach might lead to a lot of
contention over the snapshot buffers. In particular, the fact that
you have to touch shared cache lines both to advertise a new XID and
when it gets committed seems less than ideal. One thing that's kind
of interesting about the "commit sequence number" approach is that -
as far as I can tell - it doesn't require new XIDs to be advertised
anywhere at all. You don't have to worry about overflowing the
subxids[] array because it goes away altogether. The commit sequence
number itself is going to be a contention hotspot, but at least it's
small and fixed-size.
Another concern I have with this approach is - how large do you make
the buffers? If you make them too small, then you're going to have to
regenerate the snapshot frequently, which will lead to the same sort
of lock contention we have today - no one can commit while the
snapshot is being regenerated. On the other hand, if you make them
too big, then deriving a snapshot gets slow. Maybe there's some way
to make it work, but I'm afraid it might end up being yet another
arcane thing the tuning of which will become a black art among
hackers...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 09:38 -0400, Robert Haas wrote:
On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all. I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time). Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?".Why not just cache a "reference snapshots" near WAL writer and maybe
also save it at some interval in WAL in case you ever need to restore an
old snapshot at some val position for things like time travel.It may be cheaper lock-wise not to update ref. snapshot at each commit,
but to keep latest saved snapshot and a chain of transactions
committed / aborted since. This means that when reading the snapshot you
read the current "saved snapshot" and then apply the list of commits.Yeah, interesting idea. I thought about that. You'd need not only
the list of commits but also the list of XIDs that had been published,
since the commits have to be removed from the snapshot and the
newly-published XIDs have to be added to it (in case they commit later
while the snapshot is still in use).You can imagine doing this with a pair of buffers. You write a
snapshot into the beginning of the first buffer and then write each
XID that is published or commits into the next slot in the array.
When the buffer is filled up, the next process that wants to publish
an XID or commit scans through the array and constructs a new snapshot
that compacts away all the begin/commit pairs and writes it into the
second buffer, and all new snapshots are taken there. When that
buffer fills up you flip back to the first one. Of course, you need
some kind of synchronization to make sure that you don't flip back to
the first buffer while some laggard is still using it to construct a
snapshot that he started taking before you flipped to the second one,
but maybe that could be made light-weight enough not to matter.I am somewhat concerned that this approach might lead to a lot of
contention over the snapshot buffers.
My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma.
It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.
Writing WAL is already a single point which needs locks or other kind of
synchronization. This will stay with us at least until we start
supporting multiple WAL streams, and even then we will need some
synchronisation between those.
In particular, the fact that
you have to touch shared cache lines both to advertise a new XID and
when it gets committed seems less than ideal.
Every commit record writer should do this as part of writing the commit
record. And as mostly you still want the latest snapshot, why not just
update the snapshot as part of the commit/abort ?
Do we need the ability for fast "recent snapshots" at all ?
One thing that's kind
of interesting about the "commit sequence number" approach is that -
as far as I can tell - it doesn't require new XIDs to be advertised
anywhere at all. You don't have to worry about overflowing the
subxids[] array because it goes away altogether. The commit sequence
number itself is going to be a contention hotspot, but at least it's
small and fixed-size.Another concern I have with this approach is - how large do you make
the buffers? If you make them too small, then you're going to have to
regenerate the snapshot frequently, which will lead to the same sort
of lock contention we have today - no one can commit while the
snapshot is being regenerated. On the other hand, if you make them
too big, then deriving a snapshot gets slow. Maybe there's some way
to make it work, but I'm afraid it might end up being yet another
arcane thing the tuning of which will become a black art among
hackers...
--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma.It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.
I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream. It has to be
done later, at the time we currently remove ourselves from the
ProcArray. Those things need not happen in the same order, as I noted
in my original post.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma.It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.
I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream. It has to be
done later, at the time we currently remove ourselves from the
ProcArray. Those things need not happen in the same order, as I noted
in my original post.
But should we rethink that? Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing. Should we look for a way to tie "transaction becomes
visible" to its creation of a commit WAL record? I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.
regards, tom lane
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma.It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream. It has to be
done later, at the time we currently remove ourselves from the
ProcArray. Those things need not happen in the same order, as I noted
in my original post.
The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be "on stable storage", so the "current
snapshot" update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?
If the problem is with a backend which requested an "async commit", then
it is free to apply it's additional local commit changes from its own
memory if the global latest snapshot disgrees with it.
--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> writes:
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream.
The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be "on stable storage", so the "current
snapshot" update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?
No. There is no reason to tie this to fsyncing WAL. For purposes of
other currently-running transactions, the commit can be considered to
occur at the instant the commit record is inserted into WAL buffers.
If we crash before that makes it to disk, no problem, because nothing
those other transactions did will have made it to disk either. The
advantage of defining it that way is you don't have weirdly different
behaviors for sync and async transactions.
regards, tom lane
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma.It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream. It has to be
done later, at the time we currently remove ourselves from the
ProcArray. Those things need not happen in the same order, as I noted
in my original post.But should we rethink that? Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing. Should we look for a way to tie "transaction becomes
visible" to its creation of a commit WAL record? I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.
Well, I agree with you that it isn't especially nice, but it seems
like a fairly intractable problem. Currently, the standby has no way
of knowing in what order the transactions became visible on the
master. Unless we want to allow only SR and not log shipping, the
only way to communicate that information would be to WAL log it.
Aside from the expense, what do we do if XLogInsert() fails, given
that we've already committed?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream.The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be "on stable storage", so the "current
snapshot" update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?No. There is no reason to tie this to fsyncing WAL. For purposes of
other currently-running transactions, the commit can be considered to
occur at the instant the commit record is inserted into WAL buffers.
If we crash before that makes it to disk, no problem, because nothing
those other transactions did will have made it to disk either.
Agreed. Actually figured it out right after pushing send :)
The
advantage of defining it that way is you don't have weirdly different
behaviors for sync and async transactions.
My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .
--
-------
Hannu Krosing
PostgreSQL (Infinite) Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .
Well, one problem is that it would break sync rep.
Another problem is that pretty much the last thing I want to do is
push more work under WALInsertLock. Based on the testing I've done so
far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
are the main bottlenecks here. I'm focusing on ProcArrayLock and
CLogControlLock right now, but I am pretty well convinced that
WALInsertLock is going to be the hardest nut to crack, so putting
anything more under there seems like it's going in the wrong
direction. IMHO, anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 17:10 +0200, Hannu Krosing wrote:
On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream.The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be "on stable storage", so the "current
snapshot" update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?No. There is no reason to tie this to fsyncing WAL. For purposes of
other currently-running transactions, the commit can be considered to
occur at the instant the commit record is inserted into WAL buffers.
If we crash before that makes it to disk, no problem, because nothing
those other transactions did will have made it to disk either.Agreed. Actually figured it out right after pushing send :)
The
advantage of defining it that way is you don't have weirdly different
behaviors for sync and async transactions.My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .
So the basic design could be "a sparse snapshot", consisting of 'xmin,
xmax, running_txids[numbackends] where each backend manages its own slot
in running_txids - sets a txid when aquiring one and nulls it at commit,
possibly advancing xmin if xmin==mytxid. as xmin update requires full
scan of running_txids, it is also a good time to update xmax - no need
to advance xmax when "inserting" your next txid, so you don't need to
locak anything at insert time.
the valid xmax is still computed when getting the snapshot.
hmm, probably no need to store xmin and xmax at all.
it needs some further analysis to figure out, if doing it this way
without any locks can produce any relevantly bad snapshots.
maybe you still need one spinlock + memcpy of running_txids to local
memory to get snapshot.
also, as the running_txids array is global, it may need to be made even
sparser to minimise cache-line collisions. needs to be a tuning decision
between cache conflicts and speed of memcpy.
Show quoted text
--
-------
Hannu Krosing
PostgreSQL (Infinite) Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote:
On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .Well, one problem is that it would break sync rep.
Can you elaborate, in what way it "breaks" sync rep ?
Another problem is that pretty much the last thing I want to do is
push more work under WALInsertLock. Based on the testing I've done so
far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
are the main bottlenecks here. I'm focusing on ProcArrayLock and
CLogControlLock right now, but I am pretty well convinced that
WALInsertLock is going to be the hardest nut to crack, so putting
anything more under there seems like it's going in the wrong
direction.
probably it is not just the WALInsertLock, but the fact that we have
just one WAL. It can become a bottleneck once we have significant number
of processors fighting to write in single WAL.
Show quoted text
IMHO, anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
But should we rethink that? Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing. Should we look for a way to tie "transaction becomes
visible" to its creation of a commit WAL record? I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.
Well, I agree with you that it isn't especially nice, but it seems
like a fairly intractable problem. Currently, the standby has no way
of knowing in what order the transactions became visible on the
master.
Right, but if the visibility order were *defined* as the order in which
commit records appear in WAL, that problem neatly goes away. It's only
because we have the implementation artifact that "set my xid to 0 in the
ProcArray" is decoupled from inserting the commit record that there's
any difference.
regards, tom lane
On Wed, 2011-07-27 at 22:51 -0400, Robert Haas wrote:
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all. I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time). Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?". You'd need some index datastructure that made
it reasonably cheap to find out the commit locations of recently
committed transactions, where "recent" means "back to recentGlobalXmin".
That seems possibly do-able, though I don't have a concrete design in
mind.I was mulling this idea over some more (the same ideas keep floating
back to the top...). I don't think an LSN can actually work, because
there's no guarantee that the order in which the WAL records are
emitted is the same order in which the effects of the transactions
become visible to new snapshots. For example:1. Transaction A inserts its commit record, flushes WAL, and begins
waiting for sync rep.
2. A moment later, transaction B sets synchronous_commit=off, inserts
its commit record, requests a background WAL flush, and removes itself
from the ProcArray.
3. Transaction C takes a snapshot.
It is Transaction A here which is acting badly - it should also remove
itself from procArray right after it inserts its commit record, as for
everybody else except the client app of transaction A it is committed at
this point. It just cant report back to client before getting
confirmation that it is actually syncrepped (or locally written to
stable storage).
At least at the point of consistent snapshots the right sequence should
be:
1) inert commit record into wal
2) remove yourself from ProcArray (or use some other means to declare
that your transaction is no longer running)
3) if so configured, wait for WAL flus to stable storage and/or SYnc Rep
confirmation
Based on this let me suggest a simple snapshot cache mechanism
A simple snapshot cache mechanism
=================================
have an array of running transactions, with one slot per backend
txid running_transactions[max_connections];
there are exactly 3 operations on this array
1. insert backends running transaction id
-----------------------------------------
this is done at the moment of acquiring your transaction id from system,
and synchronized by the same mechanism as getting the transaction id
running_transactions[my_backend] = current_transaction_id
2. remove backends running transaction id
-----------------------------------------
this is done at the moment of committing or aborting the transaction,
again synchronized by the write commit record mechanism.
running_transactions[my_backend] = NULL
should be first thing after insertin WAcommit record
3. getting a snapshot
---------------------
memcpy() running_transactions to local memory, then construct a snapshot
it may be that you need to protect all3 operations with a single
spinlock, if so then I'd propose the same spinlock used when getting
your transaction id (and placing the array near where latest transaction
id is stored so they share cache line).
But it is also possible, that you can get logically consistent snapshots
by protecting only some ops. for example, if you protect only insert and
get snapshot, then the worst that can happen is that you get a snapshot
that is a few commits older than what youd get with full locking and it
may well be ok for all real uses.
--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, 2011-07-28 at 11:57 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
But should we rethink that? Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing. Should we look for a way to tie "transaction becomes
visible" to its creation of a commit WAL record? I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.Well, I agree with you that it isn't especially nice, but it seems
like a fairly intractable problem. Currently, the standby has no way
of knowing in what order the transactions became visible on the
master.Right, but if the visibility order were *defined* as the order in which
commit records appear in WAL, that problem neatly goes away. It's only
because we have the implementation artifact that "set my xid to 0 in the
ProcArray" is decoupled from inserting the commit record that there's
any difference.
Yes, as I explain in another e-mail, the _only_ one for whom the
transaction is not yet committed is the waiting backend itself. for all
others it should show as committed the moment after the wal record is
written.
It's kind of "local 2 phase commit" thing :)
Show quoted text
regards, tom lane