Proposal for CSN based snapshots
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication. Google's Spanner database uses snapshots based on a
similar scheme.
The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
gain.
We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.
For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up. To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.
To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.
A more detailed view of synchronization primitives required for common
operations follows.
Taking a new snapshot
---------------------
Taking a CSN based snapshot under this scheme would consist of reading
xmin, csn and xmax from global variables, unlocked and in that order
with read barriers in between each load. If this is our oldest
snapshot we write our csn_min into pgproc, do a full memory barrier
and check from a global variable if the CSN we used is still
guaranteed to not be evicted (exceedingly unlikely but cannot be ruled
out).
The read barrier between reading xmin and csn is needed so the
guarantee applies that no transaction with tx.xid < ss.xmin could have
committed with tx.csn >= ss.csn, so xmin can be used to safely exclude
csn lookups. Read barrier between reading csn and xmax is needed to
guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >=
ss.csn. From the write side, there needs to be at least one full
memory barrier between GetNewTransactionId updating nextXid and
CommitTransaction updating nextCsn, which is quite easily satisfied.
Updating global xmin without races is slightly trickier but doable.
Checking visibility
-------------------
XidInMVCCSnapshot will need to switch on the type of snapshot used
after checking xmin and xmax. For list-of-XIDs snapshots the current
logic applies. CSN based snapshots need to first do an unlocked read
from a backend specific global variable to check if we have been
instructed to convert our snapshots to xid based, if so the snapshot
is converted (more on that separately). Otherwise we look up the CSN
of the xid.
To map xid to csn, first the value from the csn slot corresponding to
the xid is read in from dense map (just a plain denseCSNMap[xid %
denseMapSize]), then after issuing a read barrier, the dense map
eviction horizon is checked to verify that the value that we read in
was in fact valid, if it is, it can be compared to the snapshot csn
value to get the result, if not continue to check the sparse map.
To check the sparse map, we read in the sparse map version counter
value and use it to determine which one of the two maps is currently
active (an optimistic locking scheme). We use linear or binary search
to look up the slot for the XID. If the XID is not found we know that
it wasn't committed after taking the snapshot, we then have to check
the clog if it was committed or not. Otherwise compare the value
against our snapshots csn to determine visibility. Finally after
issuing a read barrier, check the sparse map version counter to see if
the result is valid.
Checking from the dense map can omit clog checks as we can use special
values to signify aborted and uncommitted values. Even more so, we can
defer clog updates until the corresponding slots are evicted from the
dense map.
Assigning XIDs and evicting from CSN buffers
--------------------------------------------
XID assignment will need an additional step to allocate a CSN slot for
the transaction. If the dense map ring buffer has filled up, this
might require evicting some entries from the dense CSN map. For less
contention and better efficiency it would be a good idea to do
evictions larger blocks at a time. One 8th or 16th of the dense map at
a time might be a good balance here. Eviction will be done under a new
CSNEvictionLock.
First we publish the point up to which we are evicting from dense to
notify committing backends of the hazard.
We then read the current nextCSN value and publish it as the largest
possible global_csn_min value we can arrive at, so if there is a
backend in the middle of taking a snapshot that has fetched a CSN but
hasn't yet updated procarray entry will notice that it is at risk and
will retry. Using nextCSN is being very conservative, but as far as I
can see the invalidations will be rare and cheap. We have to issue a
full memory barrier here so either the snapshot take sees our value or
we see its csn min.
If there is enough space, we just use global_csn_min as the sparse map
eviction horizon. If there is not enough free space in the current
active sparse map to guarantee that dense map will fit, we scan both
the active sparse array and the to-be-evicted block, collecting the
missing space worth xid-csn pairs with smallest CSN values to a heap,
reducing the heap size for every xid,csn we omit due to not being
needed either because the transaction was aborted or because it is
visible to all active snapshots. When we finish scanning and the heap
isn't empty, then the largest value in the heap is the sparse map
eviction horizon.
After arriving at the correct sparse map eviction horizon, we iterate
through the sparse map and the dense map block to be evicted, copying
all active or not-all-visible CSN slots to the inactive sparse map. We
also update clog for every committed transaction that we found in the
dense map. If we decided to evict some values, we write them to the
CSN log here and update the sparse map eviction horizon with the CSN
we arrived at. At this point the state in the current active sparse
map and the evictable dense map block are duplicated into the inactive
sparse map and CSN log. We now need to make the new state visible to
visibility checkers in a safe order, issuing a write barrier before
each step so the previous changes are visible:
* Notify all backends that have csn_min <= sparse map eviction horizon
that their snapshots are invalid and at what CSN log location they can
start to read to find concurrent xids.
* Increment sparse map version counter to switch sparse maps.
* Raise the dense map eviction horizon, to free up the dense map block.
* Overwrite the dense map block with special UncommittedCSN values
that are tagged with their XID (more on that later)
At this point we can consider the block cleared up for further use.
Because we don't want to lock shared structures for snapshotting we
need to maintain a global xmin value. To do this we acquire a spinlock
on the global xmin value, and check if it's empty, if no other
transaction is running we replace it with our xid.
At this point we know the minimum CSN of any unconverted snapshots, so
it's also a good time to clean up unnecessary CSN log.
Finally we are done, so we can release CSNEvictionLock and XIDGenLock.
Committing a transaction
------------------------
Most of the commit sequence remains the same, but where we currently
do ProcArrayEndTransaction, we will acquire a new LWLock protecting
the nextCSN value, I'll call it the CSNGenLock for now. We first read
the dense array eviction horizon, then stamp all our non-overflowed or
still in dense map subtransaction slots with special SubcommittedCSN
value, then stamp the top level xid with nextCSN. We then issue a full
memory barrier, and check the soon-to-be-evicted pointer into the
dense map. If it overlaps with any of the XIDs we just tagged then the
backend doing the eviction might have missed our update, we have to
wait for CSNEvictionLock to become free and go and restamp the
relevant XIDs in the sparse map.
To stamp a XID with the commit CSN, we compare the XID to the dense
map eviction horizon. If the XID still maps to the dense array, we do
CAS and swap out UncommittedCSN(xid) with the value that we needed. If
the CAS fails, then between when we read the dense array horizon and
the actual stamping an eviction process replaced our slot with a newer
one. If we didn't tag the slots with the XID value, then we might
accidentally stamp another transactions slot. If the XID maps to the
sparse array, we have to take the CSNEvictionLock so the sparse array
doesn't get swapped out underneath us, and then look up the slot and
stamp it and then update the CLOG before releasing the lock. Lock
contention shouldn't be a problem here as only long running
transactions map to the sparse array.
When done with the stamping we will check the global xmin value, if
it's our xid, we grab the spinlock, scan through the procarray for the
next xmin value, update and release it.
Rolling back a transaction
--------------------------
Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.
Subtransactions
---------------
Because of limited size of the sparse array, we cannot keep open CSN
slots for all of the potentially unbounded number of subtransactions
there. I propose something similar to what is done currently with
PGPROC_MAX_CACHED_SUBXIDS. When we assign xids to subtransactions that
are above this limit, we tag them in the dense array with a special
OverflowedSubxidCSN value. When evicting subtransactions from dense
array, non-overflowed subtransaction slots are handled like regular
slots. We discard the overflowed slots when evicting from the dense
map. We also keep track of the lowest subxid that was overflowed and
the latest subxid that was overflowed, lowest overflowed subxid is
reset when before eviction the highest overflowed subxid is lower than
the smallest xid in the sparse array (i.e. we know that the XID region
convered by the sparse array doesn't contain any overflowed subxids).
When constructing the regular snapshot we can then detect that we
don't have the full information about subtransactions and can
correctly set the overflowed flag on the snapshot. Similarly
visibility checks can omit subxid lookups for xids missing from the
sparse array when they know that the xid can't be overflowed.
Prepared transactions
---------------------
Prepared transactions are handled basically like regular transactions,
when starting up with prepared transactions, they are inserted into
the sparse array, when they are committed they get stamped with CSNs
and become visible like usual. We just need to account for them when
sizing the sparse array.
Crash safety and checkpoints
----------------------------
Because clog updating is delayed for transactions in the dense map,
checkpoints need to flush the dense array before writing out clog.
Otherwise the datastructures are fully transient and don't need any
special treatment.
Hot standby
-----------
I haven't yet worked out how CSN based snapshots best integrate with
hot standby. As far as I can tell, we could just use the current
KnownAssignedXidsGetAndSetXmin mechanism and get regular snapshots.
But I think there is an opportunity here to get rid of most of or all
of the KnownAssignedXids mechanism if we WAL log the CSNs that are
assigned to commits (or use a side channel as proposed by Robert). The
extra write is not great, but might not be a large problem after the
WAL insertion scaling work is done. Another option would be to use the
LSN of commit record as the next CSN value. The locking in that case
requires some further thought to guarantee that commits are stamped in
the same order as WAL records are inserted without holding
WALInsertLock for too long. That seems doable by inserting commiting
backends into a queue under WALInsertLock and then have them wait for
the transaction in front of them to commit when WALInsertLock has been
released.
Serializable transactions
-------------------------
I won't pretend to be familiar with SSI code, but as far as I can tell
serializable transactions don't need any modifications to work with
the CSN based snapshot scheme. There actually already is a commit
sequence number in the SSI code that could be replaced by the actual
CSN. IIRC one of the problems with serializable transactions on hot
standby was that transaction visibility order on the standby is
different from the master. If we use CSNs for visibility on the slave
then we can actually provide identical visibility order.
Required atomic primitives
--------------------------
Besides the copious amount of memory barriers that are required for
correctness. We will need the following lockless primitives:
* 64bit atomic read
* 64bit atomic write
* 64bit CAS
Are there any supported platforms where it would be impossible to have
those? AFAIK everything from 32bit x86 through POWER and MIPS to ARM
can do it. If there are any platforms that can't handle 64bit atomics,
would it be ok to run on them with reduced concurrency/performance?
Sizing the CSN maps
-------------------
CSN maps need to sized to accomodate the number of backends.
Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.
Performance discussion
----------------------
Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.
Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.
The main extra work for writing transactions is the eviction process.
The amortized worst case extra work per xid is dominated by copying
the sparse buffer back and forth and spooling out the csn log. We need
to write out 16 bytes per xid and copy sparse buffer size / eviction
block size sparse buffer slots. If we evict 1/8th of dense map at each
eviction it works out as 520 bytes copied per xid assigned. About the
same ballpark as GetSnapshotData is now.
With the described scheme, long snapshots will cause the sparse buffer
to quickly fill up and then spool out until the backend wakes up,
converts its snapshots and releases the eviction process to free up
the log. It would be more efficient to be slightly more proactive and
tell them to convert the snapshots earlier so if they manage to be
timely with their conversion we can avoid writing any CSN log.
I'm not particularly pleased about the fact that both xid assignment
and committing can block behind the eviction lock. On the other hand,
plugging in some numbers, with 100 backends doing 100k xid
assignments/s the lock will be acquired 1000 times per second for less
than 100us at a time. The contention might not be bad enough to
warrant extra complexity to deal with it. If it does happen to be a
problem, then I have some ideas how to cope with it.
Having to do CSN log writes while holding a LWLock might not be the
best of ideas, to combat that we can either add one more buffer so we
can do the actual write syscall after we release CSNEvictionLock, or
we can reuse the SLRU machinery to handle this.
Overall it looks to be a big win for typical workloads. Workloads
using large amounts of subtransactions might not be as well off, but I
doubt there will be a regression.
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit. It might take a while though as
my free hacking time has been severely cut down since we have a small
one in the family.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Ants,
On 06/07/2013 12:42 AM, Ants Aasma wrote:
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.
Thanks for this write-up.
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
You mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication.
Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.
The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
gain.
Agreed. Para-phrasing, you can also say that CSNs can only ever identify
completed transactions, where XIDs can be used to identify transactions
in progress as well.
[ We cannot possibly get rid of XIDs entirely for that reason. And the
mapping between CSNs and XIDs has some additional cost. ]
We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.
I like this idea of dense and sparse "areas". Seems like a simple
enough, yet reasonably compact representation that might work well in
practice.
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up.
I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.
What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
unbounded.
To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.
Hm, I dislike the requirement to maintain two different snapshot formats.
Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.
Rolling back a transaction
--------------------------Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.
Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.
Or why do you need to tell apart aborted from in-progress transactions
by CSN?
Sizing the CSN maps
-------------------CSN maps need to sized to accomodate the number of backends.
Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.
Sounds reasonable to me.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.
A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
Performance discussion
----------------------Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.
Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.
Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit.
Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format. :-)
we have a small one in the family.
Congratulations on that one.
Regards
Markus Wanner
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma <ants@cybertec.at> wrote:
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits
I think we would just use the LSN of the commit record which is
effectively the same but doesn't require a new counter.
I don't think this changes anything though.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote:
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID lessYou mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.
That was what I meant. I guess my coffee hadn't kicked in yet there.
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication.Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.
Do you think having this snapshot scheme would be helpful for Postgres-R?
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up.I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
unbounded.
Yes, that is what I meant to say but for some reason didn't.
To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.Hm, I dislike the requirement to maintain two different snapshot formats.
Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.
I agree that two snapshot formats is not great. On the other hand, the
additional logic is confined to XidInMVCCSnapshot and is reasonably
simple. If we didn't convert the snapshots we would have to keep
spooling out CSN log and look up XIDs for each visibility check. We
could add a cache for XIDs that were deemed concurrent, but that is in
effect just lazily constructing the same datastructure. The work
needed to convert is reasonably well bounded, can be done without
holding global locks and in most circumstances should only be
necessary for snapshots that are used for a long time and will
amortize the cost. I'm not worried about the bursts because the
conversion is done lockless and starting at the same point in time
leads to better cache utilization.
Rolling back a transaction
--------------------------Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.Or why do you need to tell apart aborted from in-progress transactions
by CSN?
I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slots with
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
8 byte alignment for CSNs is needed for atomic if not something else.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
gotchas.
Performance discussion
----------------------Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)
Visibility checks are done lock free so they don't cause any
contention. The number of times each cache line can be invalidated is
bounded by 8. Overall I think actual performance tests are needed to
see if there is a problem, or perhaps if having the data shared
actually helps with cache hit rates.
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit.Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format. :-)we have a small one in the family.
Congratulations on that one.
Thanks,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 7, 2013 at 3:47 PM, Greg Stark <stark@mit.edu> wrote:
On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma <ants@cybertec.at> wrote:
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commitsI think we would just use the LSN of the commit record which is
effectively the same but doesn't require a new counter.
I don't think this changes anything though.
I briefly touched on that point in the Hot Standby section. This has
some consequences for locking in CommitTransaction, but otherwise LSN
is as good as any other monotonically increasing value.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Ants Aasma <ants@cybertec.at> wrote:
Serializable transactions
-------------------------I won't pretend to be familiar with SSI code, but as far as I can
tell serializable transactions don't need any modifications to
work with the CSN based snapshot scheme. There actually already
is a commit sequence number in the SSI code that could be
replaced by the actual CSN.
That seems quite likely to work, and may be good for performance.
IIRC one of the problems with serializable transactions on hot
standby was that transaction visibility order on the standby is
different from the master.
Pretty much. Technically what SSI does is to ensure that every
serializable transaction's view of the data is consistent with some
serial (one-at-a time) exection of serializable transactions. That
"apparent order of execution" does not always match commit order.
The problem is that without further work a hot standby query could
see a state which would not have been possible for it to see on the
master. For example, a batch is closed but a transaction has not
yet committed which is part of that batch. For an example, see:
http://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions
As that example demonstrates, as long as no serializable
transaction *sees* the incomplete batch with knowledge (or at least
potential knowledge) of the batch being closed, the pending
transaction affecting the batch is allowed to commit. If the
conflicting state is exposed by even a read-only query, the
transaction with the pending change to the batch is canceled.
A hot standby can't cancel the pending transaction -- at least not
without adding additional communications channels and latency. The
ideas which have been bandied about have had to do with allowing
serializable transactions on a hot standby to use snapshots which
are known to be "safe" -- that is, they cannot expose any such
state. It might be possible to identify known safe points in the
commit stream on the master and pass that information along in the
WAL stream. The down side is that the hot standby would need to
either remember the last such safe snapshot or wait for the next
one, and while these usually come up fairly quickly in most
workloads, there is no actual bounding on how long that could take.
A nicer solution, if we can manage it, is to allow snapshots on the
hot standby which are not based exclusively on the commit order,
but use the apparent order of execution from the master. It seems
non-trivial to get that right.
If we use CSNs for visibility on the slave then we can actually
provide identical visibility order.
As the above indicates, that's not really true without using
"apparent order of execution" instead of "commit order". In the
absence of serializable transactions those are always the same (I
think), but to provide a way to allow serializable transactions on
the hot standby the order would need to be subject to rearrangement
based on read-write conflicts among transactions on the master, or
snapshots which could expose serialization anomalies would need to
be prohibited.
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Ants,
the more I think about this, the more I start to like it.
On 06/07/2013 02:50 PM, Ants Aasma wrote:
On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote:
Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.Do you think having this snapshot scheme would be helpful for Postgres-R?
Yeah, it could help to reduce patch size, after a rewrite to use such a CSN.
Or why do you need to tell apart aborted from in-progress transactions
by CSN?I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slots with
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.
I see.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.8 byte alignment for CSNs is needed for atomic if not something else.
Oh, right, atomic writes.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
gotchas.
In Postgres-R, I effectively used a 32-bit order id which wraps around.
In this case, I guess adjusting the base value will get tricky. Wrapping
could probably be used as well, instead.
The number of times each cache line can be invalidated is
bounded by 8.
Hm.. good point.
Regards
Markus Wanner
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11th June 2013, Markus Wanner wrote:
Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.Do you think having this snapshot scheme would be helpful for
Postgres-R?
Yeah, it could help to reduce patch size, after a rewrite to use such a
CSN.Or why do you need to tell apart aborted from in-progress
transactions by CSN?I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slotswith
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.I see.
Sparse buffer needs to be at least big enough to fit CSN slots for
the xids of all active transactions and non-overflowed
subtransactions. At the current level PGPROC_MAX_CACHED_SUBXIDS=64,
the minimum comes out at 16 bytes * (64 + 1) slots * 100 =backends
= 101.6KB per buffer, or 203KB total in the default configuration.
A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I
guess the given 16 bytes includes alignment to 8 byte boundaries.Sounds good.
8 byte alignment for CSNs is needed for atomic if not something else.
Oh, right, atomic writes.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
gotchas.In Postgres-R, I effectively used a 32-bit order id which wraps around.
In this case, I guess adjusting the base value will get tricky.
Wrapping could probably be used as well, instead.The number of times each cache line can be invalidated is bounded by
8.Hm.. good point.
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on this.
If not then what is the reason?
Am I missing something?
Thanks and Regards,
Kumar Rajeev Rastogi
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on this.
I started looking into this, and plan to work on this for 9.5. It's a
big project, so any help is welcome. The design I have in mind is to use
the LSN of the commit record as the CSN (as Greg Stark suggested).
Some problems and solutions I have been thinking of:
The core of the design is to store the LSN of the commit record in
pg_clog. Currently, we only store 2 bits per transaction there,
indicating if the transaction committed or not, but the patch will
expand it to 64 bits, to store the LSN. To check the visibility of an
XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and
compared with the snapshot's LSN.
Currently, before consulting the clog for an XID's status, it is
necessary to first check if the transaction is still in progress by
scanning the proc array. To get rid of that requirement, just before
writing the commit record in the WAL, the backend will mark the clog
slot with a magic value that says "I'm just about to commit". After
writing the commit record, it is replaced with the record's actual LSN.
If a backend sees the magic value in the clog, it will wait for the
transaction to finish the insertion, and then check again to get the
real LSN. I'm thinking of just using XactLockTableWait() for that. This
mechanism makes the insertion of a commit WAL record and updating the
clog appear atomic to the rest of the system.
With this mechanism, taking a snapshot is just a matter of reading the
current WAL insertion point. There is no need to scan the proc array,
which is good. However, it probably still makes sense to record an xmin
and an xmax in SnapshotData, for performance reasons. An xmax, in
particular, will allow us to skip checking the clog for transactions
that will surely not be visible. We will no longer track the latest
completed XID or the xmin like we do today, but we can use
SharedVariableCache->nextXid as a conservative value for xmax, and keep
a cached global xmin value in shared memory, updated when convenient,
that can be just copied to the snapshot.
In theory, we could use a snapshot LSN as the cutoff-point for
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
that makes me feel uneasy. In any case, I think we'll need a cut-off
point defined as an XID rather than an LSN for freezing purposes. In
particular, we need a cut-off XID to determine how far the pg_clog can
be truncated, and to store in relfrozenxid. So, we will still need the
concept of a global oldest xmin.
When a snapshot is just an LSN, taking a snapshot can no longer
calculate an xmin, like we currently do (there will be a snapshot LSN in
place of an xmin in the proc array). So we will need a new mechanism to
calculate the global oldest xmin. First scan the proc array to find the
oldest still in-progress XID. That - 1 will become the new oldest global
xmin, after all currently active snapshots have finished. We don't want
to sleep in GetOldestXmin(), waiting for the snapshots to finish, so we
should periodically advance a system-wide oldest xmin value, for example
whenever the walwrite process wakes up, so that when we need an
oldest-xmin value, we will always have a fairly recently calculated
value ready in shared memory.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-12 16:56:51 +0300, Heikki Linnakangas wrote:
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on this.I started looking into this, and plan to work on this for 9.5. It's a big
project, so any help is welcome. The design I have in mind is to use the LSN
of the commit record as the CSN (as Greg Stark suggested).
Cool.
I haven't fully thought it through but I think it should make some of
the decoding code simpler. And it should greatly simplify the hot
standby code.
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?
The core of the design is to store the LSN of the commit record in pg_clog.
Currently, we only store 2 bits per transaction there, indicating if the
transaction committed or not, but the patch will expand it to 64 bits, to
store the LSN. To check the visibility of an XID in a snapshot, the XID's
commit LSN is looked up in pg_clog, and compared with the snapshot's LSN.
We'll continue to need some of the old states? You plan to use values
that can never be valid lsns for them? I.e. 0/0 IN_PROGRESS, 0/1 ABORTED
etc?
How do you plan to deal with subtransactions?
Currently, before consulting the clog for an XID's status, it is necessary
to first check if the transaction is still in progress by scanning the proc
array. To get rid of that requirement, just before writing the commit record
in the WAL, the backend will mark the clog slot with a magic value that says
"I'm just about to commit". After writing the commit record, it is replaced
with the record's actual LSN. If a backend sees the magic value in the clog,
it will wait for the transaction to finish the insertion, and then check
again to get the real LSN. I'm thinking of just using XactLockTableWait()
for that. This mechanism makes the insertion of a commit WAL record and
updating the clog appear atomic to the rest of the system.
So it's quite possible that clog will become more of a contention point
due to the doubled amount of writes.
In theory, we could use a snapshot LSN as the cutoff-point for
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
that makes me feel uneasy.
It'd possibly also end up being less efficient because you'd visit the
clog for potentially quite some transactions to get the LSN.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/12/2014 05:41 PM, Andres Freund wrote:
I haven't fully thought it through but I think it should make some of
the decoding code simpler. And it should greatly simplify the hot
standby code.
Cool. I was worried it might conflict with the logical decoding stuff in
some fundamental way, as I'm not really familiar with it.
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?
Not sure. I got to the point where it seemed to work, but I got a bit of
a cold feet proceeding with it. I used the page header's LSN field to
define the "epoch" of the page, but I started to feel uneasy about it. I
would be much more comfortable with an extra field in the page header,
even though that uses more disk space. And requires dealing with pg_upgrade.
The core of the design is to store the LSN of the commit record in pg_clog.
Currently, we only store 2 bits per transaction there, indicating if the
transaction committed or not, but the patch will expand it to 64 bits, to
store the LSN. To check the visibility of an XID in a snapshot, the XID's
commit LSN is looked up in pg_clog, and compared with the snapshot's LSN.We'll continue to need some of the old states? You plan to use values
that can never be valid lsns for them? I.e. 0/0 IN_PROGRESS, 0/1 ABORTED
etc?
Exactly.
Using 64 bits per XID instead of just 2 will obviously require a lot
more disk space, so we might actually want to still support the old clog
format too, as an "archive" format. The clog for old transactions could
be converted to the more compact 2-bits per XID format (or even just 1 bit).
How do you plan to deal with subtransactions?
pg_subtrans will stay unchanged. We could possibly merge it with
pg_clog, reserving some 32-bit chunk of values that are not valid LSNs
to mean an uncommitted subtransaction, with the parent XID. That assumes
that you never need to look up the parent of an already-committed
subtransaction. I thought that was true at first, but I think the SSI
code looks up the parent of a committed subtransaction, to find its
predicate locks. Perhaps it could be changed, but seems best to leave it
alone for now; there will be a lot code churn anyway.
I think we can get rid of the sub-XID array in PGPROC. It's currently
used to speed up TransactionIdIsInProgress(), but with the patch it will
no longer be necessary to call TransactionIdIsInProgress() every time
you check the visibility of an XID, so it doesn't need to be so fast
anymore.
With the new "commit-in-progress" status in clog, we won't need the
sub-committed clog status anymore. The "commit-in-progress" status will
achieve the same thing.
Currently, before consulting the clog for an XID's status, it is necessary
to first check if the transaction is still in progress by scanning the proc
array. To get rid of that requirement, just before writing the commit record
in the WAL, the backend will mark the clog slot with a magic value that says
"I'm just about to commit". After writing the commit record, it is replaced
with the record's actual LSN. If a backend sees the magic value in the clog,
it will wait for the transaction to finish the insertion, and then check
again to get the real LSN. I'm thinking of just using XactLockTableWait()
for that. This mechanism makes the insertion of a commit WAL record and
updating the clog appear atomic to the rest of the system.So it's quite possible that clog will become more of a contention point
due to the doubled amount of writes.
Yeah. OTOH, each transaction will take more space in the clog, which
will spread the contention across more pages. And I think there are ways
to mitigate contention in clog, if it becomes a problem. We could make
the locking more fine-grained than one lock per page, use atomic 64-bit
reads/writes on platforms that support it, etc.
In theory, we could use a snapshot LSN as the cutoff-point for
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
that makes me feel uneasy.It'd possibly also end up being less efficient because you'd visit the
clog for potentially quite some transactions to get the LSN.
True.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 10:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-05-12 16:56:51 +0300, Heikki Linnakangas wrote:
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on this.I started looking into this, and plan to work on this for 9.5. It's a big
project, so any help is welcome. The design I have in mind is to use the LSN
of the commit record as the CSN (as Greg Stark suggested).Cool.
Yes, very cool. I remember having some concerns about using the LSN
of the commit record as the CSN. I think the biggest one was the need
to update clog with the CSN before the commit record had been written,
which your proposal to store a temporary sentinel value there until
the commit has completed might address. However, I wonder what
happens if you write the commit record and then the attempt to update
pg_clog fails. I think you'll have to PANIC, which kind of sucks. It
would be nice to pin the pg_clog page into the SLRU before writing the
commit record so that we don't have to fear needing to re-read it
afterwards, but the SLRU machinery doesn't currently have that notion.
Another thing to think about is that LSN = CSN will make things much
more difficult if we ever want to support multiple WAL streams with a
separate LSN sequence for each. Perhaps you'll say that's a pipe
dream anyway, and I agree it's probably 5 years out, but I think it
may be something we'll want eventually. With synthetic CSNs those
systems are more decoupled. OTOH, one advantage of LSN = CSN is that
the commit order as seen on the standby would always match the commit
order as seen on the master, which is currently not true, and would be
a very nice property to have.
I think we're likely to find that system performance is quite
sensitive to any latency in updating the global-xmin. One thing about
the present system is that if you take a snapshot while a very "old"
transaction is still running, you're going to use that as your
global-xmin for the entire lifetime of your transaction. It might be
possible, with some of the rejiggering you're thinking about, to
arrange things so that there are opportunities for processes to roll
forward their notion of the global-xmin, making HOT pruning more
efficient. Whether anything good happens there or not is sort of a
side issue, but we need to make sure the efficiency of HOT pruning
doesn't regress.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 4:56 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on
this.I started looking into this, and plan to work on this for 9.5. It's a big
project, so any help is welcome. The design I have in mind is to use the LSN
of the commit record as the CSN (as Greg Stark suggested).
I did do some coding work on this, but the free time I used to work on
this basically disappeared with a child in the family. I guess what I
have has bitrotted beyond recognition. However I may still have some
insight that may be of use.
From your comments I presume that you are going with the original,
simpler approach proposed by Robert to simply keep the XID-CSN map
around for ever and probe it for all visibility lookups that lie
outside of the xmin-xmax range? As opposed to the more complex hybrid
approach I proposed that keeps a short term XID-CSN map and lazily
builds conventional list-of-concurrent-XIDs snapshots for long lived
snapshots. I think that would be prudent, as the simpler approach
needs mostly the same ground work and if turns out to work well
enough, simpler is always better.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-12 18:01:59 +0300, Heikki Linnakangas wrote:
On 05/12/2014 05:41 PM, Andres Freund wrote:
I haven't fully thought it through but I think it should make some of
the decoding code simpler. And it should greatly simplify the hot
standby code.Cool. I was worried it might conflict with the logical decoding stuff in
some fundamental way, as I'm not really familiar with it.
My gut feeling is that it should be possible to make it work. I'm too
deep into the last week of a project to properly analyze it, but I am
sure we'll find a place to grab a drink and discuss it next
week.
Essentially all it needs is to be able to represent snapshots from the
past including (and that's the hard part) a snapshot from somewhere in
the midst of an xact. The latter is done by loggin cmin/cmax for all
catalog tuples and building a lookup table when looking inside an
xact. That shouldn't change much for CSN based snapshots. I think.
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?Not sure. I got to the point where it seemed to work, but I got a bit of a
cold feet proceeding with it. I used the page header's LSN field to define
the "epoch" of the page, but I started to feel uneasy about it.
Yea. I don't think the approach is fundamentally broken but it touches a
*lot* of arcane places... Or at least it needs to touch many and the
trick is finding them all :)
I would be
much more comfortable with an extra field in the page header, even though
that uses more disk space. And requires dealing with pg_upgrade.
Maybe we can reclaim pagesize_version and prune_xid in some way? It
seems to be prune_xid could be represented as an LSN with CSN snapshots
combined with your freezing approach and we probably don't need the last
two bytes of the lsn for that purpose...
Using 64 bits per XID instead of just 2 will obviously require a lot more
disk space, so we might actually want to still support the old clog format
too, as an "archive" format. The clog for old transactions could be
converted to the more compact 2-bits per XID format (or even just 1 bit).
Wouldn't it make more sense to have two slrus then? A SRLU with dynamic
width doesn't seem easily doable.
How do you plan to deal with subtransactions?
pg_subtrans will stay unchanged. We could possibly merge it with pg_clog,
reserving some 32-bit chunk of values that are not valid LSNs to mean an
uncommitted subtransaction, with the parent XID. That assumes that you never
need to look up the parent of an already-committed subtransaction. I thought
that was true at first, but I think the SSI code looks up the parent of a
committed subtransaction, to find its predicate locks. Perhaps it could be
changed, but seems best to leave it alone for now; there will be a lot code
churn anyway.I think we can get rid of the sub-XID array in PGPROC. It's currently used
to speed up TransactionIdIsInProgress(), but with the patch it will no
longer be necessary to call TransactionIdIsInProgress() every time you check
the visibility of an XID, so it doesn't need to be so fast anymore.
Whether it can be removed depends on how the whole hot standby stuff is
dealt with... Also, there's some other callsites that do
TransactionIdIsInProgress() at some frequency. Just think about all the
multixact business :(
With the new "commit-in-progress" status in clog, we won't need the
sub-committed clog status anymore. The "commit-in-progress" status will
achieve the same thing.
Wouldn't that cause many spurious waits? Because commit-in-progress
needs to be waited on, but a sub-committed xact surely not?
So it's quite possible that clog will become more of a contention point
due to the doubled amount of writes.Yeah. OTOH, each transaction will take more space in the clog, which will
spread the contention across more pages. And I think there are ways to
mitigate contention in clog, if it becomes a problem.
I am not opposed, more wondering if you'd thought about it.
I don't think spreading the contention works very well with the current
implementation of slru.c. It's already very prone to throwing away the
wrong page. Widening it will just make that worse.
We could make the
locking more fine-grained than one lock per page, use atomic 64-bit
reads/writes on platforms that support it, etc.
We *really* need an atomics abstraction layer... There's more and more
stuff coming that needs it.
This is going to be a *large* patch.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 6:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
However, I wonder what
happens if you write the commit record and then the attempt to update
pg_clog fails. I think you'll have to PANIC, which kind of sucks.
CLOG IO error while commiting is already a PANIC, SimpleLruReadPage()
does SlruReportIOError(), which in turn does ereport(ERROR), while
inside a critical section initiated in RecordTransactionCommit().
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 2:56 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Currently, before consulting the clog for an XID's status, it is necessary
to first check if the transaction is still in progress by scanning the proc
array. To get rid of that requirement, just before writing the commit record
in the WAL, the backend will mark the clog slot with a magic value that says
"I'm just about to commit". After writing the commit record, it is replaced
with the record's actual LSN. If a backend sees the magic value in the clog,
it will wait for the transaction to finish the insertion, and then check
again to get the real LSN. I'm thinking of just using XactLockTableWait()
for that. This mechanism makes the insertion of a commit WAL record and
updating the clog appear atomic to the rest of the system.
Would it be useful to store the current WAL insertion point along with
the "about to commit" flag so it's effectively a promise that this
transaction will commit no earlier than XXX. That should allow most
transactions to decide if those records are visible or not unless
they're very recent transactions which started in that short window
while the committing transaction was in the process of committing.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/12/2014 06:26 PM, Andres Freund wrote:
With the new "commit-in-progress" status in clog, we won't need the
sub-committed clog status anymore. The "commit-in-progress" status will
achieve the same thing.Wouldn't that cause many spurious waits? Because commit-in-progress
needs to be waited on, but a sub-committed xact surely not?
Ah, no. Even today, a subxid isn't marked as sub-committed, until you
commit the top-level transaction. The sub-commit state is a very
transient state during the commit process, used to make the commit of
the sub-transactions and the top-level transaction appear atomic. The
commit-in-progress state would be a similarly short-lived state. You
mark the subxids and the top xid as commit-in-progress just before the
XLogInsert() of the commit record, and you replace them with the real
LSNs right after XLogInsert().
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-12 19:14:55 +0300, Heikki Linnakangas wrote:
On 05/12/2014 06:26 PM, Andres Freund wrote:
With the new "commit-in-progress" status in clog, we won't need the
sub-committed clog status anymore. The "commit-in-progress" status will
achieve the same thing.Wouldn't that cause many spurious waits? Because commit-in-progress
needs to be waited on, but a sub-committed xact surely not?Ah, no. Even today, a subxid isn't marked as sub-committed, until you commit
the top-level transaction. The sub-commit state is a very transient state
during the commit process, used to make the commit of the sub-transactions
and the top-level transaction appear atomic. The commit-in-progress state
would be a similarly short-lived state. You mark the subxids and the top xid
as commit-in-progress just before the XLogInsert() of the commit record, and
you replace them with the real LSNs right after XLogInsert().
Ah, right. Forgot that detail...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 7:10 PM, Greg Stark <stark@mit.edu> wrote:
Would it be useful to store the current WAL insertion point along with
the "about to commit" flag so it's effectively a promise that this
transaction will commit no earlier than XXX. That should allow most
transactions to decide if those records are visible or not unless
they're very recent transactions which started in that short window
while the committing transaction was in the process of committing.
I don't believe this is worth the complexity. The contention window is
extremely short here.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12 May 2014 19:27, Heikki Linnakangas Wrote:
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happeningon this.
I started looking into this, and plan to work on this for 9.5. It's a
big project, so any help is welcome. The design I have in mind is to
use the LSN of the commit record as the CSN (as Greg Stark suggested).
Great !
Some problems and solutions I have been thinking of:
The core of the design is to store the LSN of the commit record in
pg_clog. Currently, we only store 2 bits per transaction there,
indicating if the transaction committed or not, but the patch will
expand it to 64 bits, to store the LSN. To check the visibility of an
XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and
compared with the snapshot's LSN.
Isn't it will be bit in-efficient to look in to pg_clog to read XID's commit
LSN for every visibility check?
With this mechanism, taking a snapshot is just a matter of reading the
current WAL insertion point. There is no need to scan the proc array,
which is good. However, it probably still makes sense to record an xmin
and an xmax in SnapshotData, for performance reasons. An xmax, in
particular, will allow us to skip checking the clog for transactions
that will surely not be visible. We will no longer track the latest
completed XID or the xmin like we do today, but we can use
SharedVariableCache->nextXid as a conservative value for xmax, and keep
a cached global xmin value in shared memory, updated when convenient,
that can be just copied to the snapshot.
I think we can update xmin, whenever transaction with its XID equal
to xmin gets committed (i.e. in ProcArrayEndTransaction).
Thanks and Regards,
Kumar Rajeev Rastogi
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 7:26 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
In theory, we could use a snapshot LSN as the cutoff-point for
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
that makes me feel uneasy.
To accomplish this won't XID-CSN map table be required and how will
it be maintained (means when to clear and add a entry to that map table)?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/13/2014 09:44 AM, Amit Kapila wrote:
On Mon, May 12, 2014 at 7:26 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:In theory, we could use a snapshot LSN as the cutoff-point for
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
that makes me feel uneasy.To accomplish this won't XID-CSN map table be required and how will
it be maintained (means when to clear and add a entry to that map table)?
Not sure I understand. The clog is a mapping from XID to CSN. What
vacuum needs to know is whether the xmin and/or xmax is visible to
everyone (and whether they committed or aborted). To determine that, it
needs the oldest still active snapshot LSN. That can be found by
scanning the proc array. It's pretty much the same as a regular MVCC
visibility check, but using the oldest still-active snapshot.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/13/2014 08:08 AM, Rajeev rastogi wrote:
The core of the design is to store the LSN of the commit record in
pg_clog. Currently, we only store 2 bits per transaction there,
indicating if the transaction committed or not, but the patch will
expand it to 64 bits, to store the LSN. To check the visibility of an
XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and
compared with the snapshot's LSN.Isn't it will be bit in-efficient to look in to pg_clog to read XID's commit
LSN for every visibility check?
Maybe. If no hint bit is set on the tuple, you have to check the clog
anyway to determine if the tuple is committed. And if for XIDs older
than xmin or newer than xmax, you don't need to check pg_clog. But it's
true that for tuples with hint bit set, and xmin < XID < xmax, you have
to check the pg_clog in the new system, when currently you only need to
do a binary search of the local array in the snapshot. My gut feeling is
that it won't be significantly slower in practice. If it becomes a
problem, some rearrangement pg_clog code might help, or you could build
a cache of XID->CSN mappings that you've alread looked up in
SnapshotData. So I don't think that's going to be a show-stopper.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 13, 2014 at 1:59 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
On 05/13/2014 09:44 AM, Amit Kapila wrote:
To accomplish this won't XID-CSN map table be required and how will
it be maintained (means when to clear and add a entry to that map table)?Not sure I understand. The clog is a mapping from XID to CSN.
The case I was referring is for xmin < XID < xmax, which you have
mentioned below that you are planing to directly refer pg_clog. This
is I think one of the main place where new design can have impact
on performance, but as you said it is better to first do the implementation
based on pg_clog rather than directly jumping to optimize by maintaining
XID to CSN mapping.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 May 2014 14:06, Heikki Linnakangas
The core of the design is to store the LSN of the commit record in
pg_clog. Currently, we only store 2 bits per transaction there,
indicating if the transaction committed or not, but the patch will
expand it to 64 bits, to store the LSN. To check the visibility of
an XID in a snapshot, the XID's commit LSN is looked up in pg_clog,
and compared with the snapshot's LSN.Isn't it will be bit in-efficient to look in to pg_clog to read XID's
commit LSN for every visibility check?Maybe. If no hint bit is set on the tuple, you have to check the clog
anyway to determine if the tuple is committed. And if for XIDs older
than xmin or newer than xmax, you don't need to check pg_clog. But it's
true that for tuples with hint bit set, and xmin < XID < xmax, you have
to check the pg_clog in the new system, when currently you only need to
do a binary search of the local array in the snapshot. My gut feeling
is that it won't be significantly slower in practice. If it becomes a
problem, some rearrangement pg_clog code might help, or you could build
a cache of XID->CSN mappings that you've alread looked up in
SnapshotData. So I don't think that's going to be a show-stopper.
Yes definitely it should not be not show-stopper. This can be optimized later by method
as you mentioned and also by some cut-off technique based on which we can
decide that a XID beyond a certain range will be always visible, and thereby
avoiding look-up in pg_clog.
Thanks and Regards,
Kumar Rajeev Rastogi
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?Not sure. I got to the point where it seemed to work, but I got a
bit of a cold feet proceeding with it. I used the page header's LSN
field to define the "epoch" of the page, but I started to feel
uneasy about it. I would be much more comfortable with an extra
field in the page header, even though that uses more disk space. And
requires dealing with pg_upgrade.
FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
pg_upgrade issue anyway.
I am not excited about a 32x increase in clog size, especially since we
already do freezing at 200M transactions to allow for more aggressive
clog trimming. Extrapolating that out, it means we would freeze every
6.25M transactions.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?Not sure. I got to the point where it seemed to work, but I got a
bit of a cold feet proceeding with it. I used the page header's LSN
field to define the "epoch" of the page, but I started to feel
uneasy about it. I would be much more comfortable with an extra
field in the page header, even though that uses more disk space. And
requires dealing with pg_upgrade.FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
pg_upgrade issue anyway.I am not excited about a 32x increase in clog size, especially since we
already do freezing at 200M transactions to allow for more aggressive
clog trimming. Extrapolating that out, it means we would freeze every
6.25M transactions.
It seems better to allow clog to grow larger than to force
more-frequent freezing.
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?Not sure. I got to the point where it seemed to work, but I got a
bit of a cold feet proceeding with it. I used the page header's LSN
field to define the "epoch" of the page, but I started to feel
uneasy about it. I would be much more comfortable with an extra
field in the page header, even though that uses more disk space. And
requires dealing with pg_upgrade.FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
pg_upgrade issue anyway.I am not excited about a 32x increase in clog size, especially since we
already do freezing at 200M transactions to allow for more aggressive
clog trimming. Extrapolating that out, it means we would freeze every
6.25M transactions.
The default setting imo is far too low for a database of any relevant
activity. If I had the stomach for the fight around it I'd suggest
increasing it significantly by default. People with small databases
won't be hurt significantly because they simply don't have that many
transactions and autovacuum will get around to cleanup long before
normally.
It seems better to allow clog to grow larger than to force
more-frequent freezing.
Yes.
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.
I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 15, 2014 at 10:06:32PM +0200, Andres Freund wrote:
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.
Yes, I like the idea of storing the CSN separately. One reason the
2-bit clog is so good is that we know we have atomic 1-byte writes on
all platforms. Can we assume atomic 64-bit writes?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-15 16:13:49 -0400, Bruce Momjian wrote:
On Thu, May 15, 2014 at 10:06:32PM +0200, Andres Freund wrote:
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.Yes, I like the idea of storing the CSN separately. One reason the
2-bit clog is so good is that we know we have atomic 1-byte writes on
all platforms.
I don't think we rely on that anywhere. And in fact we don't have the
ability to do so for arbitrary bytes - lots of platforms can do that
only on specifically aligned bytes.
We rely on being able to atomically (as in either before/after no torn
value) write/read TransactionIds, but that's it I think?
Can we assume atomic 64-bit writes?
Not on 32bit platforms.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.
Err, we're proposing a patch to add timestamps to each commit,
/messages/by-id/20131022221600.GE4987@eldon.alvh.no-ip.org
which does so in precisely this way.
The idea that pg_csn or pg_committs can be truncated much earlier than
pg_clog has its merit, no doubt. If we can make sure that the atomicity
is sane, +1 from me.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-05-15 17:37:14 -0400, Alvaro Herrera wrote:
Andres Freund wrote:
On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.Err, we're proposing a patch to add timestamps to each commit,
/messages/by-id/20131022221600.GE4987@eldon.alvh.no-ip.org
which does so in precisely this way.
I am not sure where my statements above conflict with committs?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
On 2014-05-15 17:37:14 -0400, Alvaro Herrera wrote:
Andres Freund wrote:
On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact. Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more. So we could read the old pg_clog files and write
new summary files. Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted. Once we've written and fsync'd the
summary files, we could throw away the original files. That might
leave us with a smaller pg_clog than what we have today.I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.Err, we're proposing a patch to add timestamps to each commit,
/messages/by-id/20131022221600.GE4987@eldon.alvh.no-ip.org
which does so in precisely this way.I am not sure where my statements above conflict with committs?
I didn't say it did ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
So, here's a first version of the patch. Still very much WIP.
One thorny issue came up in discussions with other hackers on this in PGCon:
When a transaction is committed asynchronously, it becomes visible to
other backends before the commit WAL record is flushed. With CSN-based
snapshots, the order that transactions become visible is always based on
the LSNs of the WAL records. This is a problem when there is a mix of
synchronous and asynchronous commits:
If transaction A commits synchronously with commit LSN 1, and
transaction B commits asynchronously with commit LSN 2, B cannot become
visible before A. And we cannot acknowledge B as committed to the client
until it's visible to other transactions. That means that B will have to
wait for A's commit record to be flushed to disk, before it can return,
even though it was an asynchronous commit.
I personally think that's annoying, but we can live with it. The most
common usage of synchronous_commit=off is to run a lot of transactions
in that mode, setting it in postgresql.conf. And it wouldn't completely
defeat the purpose of mixing synchronous and asynchronous commits
either: an asynchronous commit still only needs to wait for any
already-logged synchronous commits to be flushed to disk, not the commit
record of the asynchronous transaction itself.
Ants' original design with a separate commit-sequence-number that's
different from the commit LSN would not have this problem, because that
would allow the commits to become visible to others in out-of-WAL-order.
However, the WAL order == commit order is a nice and simple property,
with other advantages.
Some bigger TODO items:
* Logical decoding is broken. I hacked on it enough that it looks
roughly sane and it compiles, but didn't spend more time to debug.
* I expanded pg_clog to 64-bits per XID, but people suggested keeping
pg_clog as is, with two bits per commit, and adding a new SLRU for the
commit LSNs beside it. Probably will need to do something like that to
avoid bloating the clog.
* Add some kind of backend-private caching of clog, to make it faster to
access. The visibility checks are now hitting the clog a lot more
heavily than before, as you need to check the clog even if the hint bits
are set, if the XID falls between xmin and xmax of the snapshot.
* Transactions currently become visible immediately when a WAL record is
inserted, before it's flushed. That's wrong, but shouldn't be difficult
to fix (except for the async commit issue explained above).
- Heikki
Attachments:
csn-1.patch.gzapplication/gzip; name=csn-1.patch.gzDownload
����S csn-1.patch �<kS���������
��0� �@���dkk�%Km�B�<��a�}��[j��q���*���Ow��Kv��X�j�n,���q������#��q��F���p��b���-�w�����~��������m�j6{��V�V�d��J���j�������}Q��py{~q~z'�]�������K1����b9��{���a(g|���F^��Cy�(�7��b�W�[��+����b"��E[�<��o#��0��<������F8s[Z#O
;��S?��-�?\d�[SY3o~��U�?���`eu�&����;{�`�������������8���F2VS�a��q��A>,B+y�����#��E� .Z&7� T<�X<?5��F=�^��� g��"��C/�l7M�"���#�EAG�T���P:O��(��|�v#
����k ��
�������5rl�[�;��~w��r�����+"����{�VST�[�`a���!�A(��� ��W'w��G���9$�5h������x��C<q4��>�O������5�#�V�[&�u�����`k|��f��j��z�zh�.AC<��X��0�+�7s��gz�Q#�]��`�<�iy�GZ���U>�Y!�-�>�s'v� ��FU�(�,
�1C0#t��[�&�q]�*����C��,VE���-c���B����Z, �_
�u���(�w�Mf:}% �K���?������4,�6&���kZ�Ab_�t����;m�^�����������vxq�L� RR��}��Z�����)~�Q��P�� "VI(x6�Vz��D-����� �H��;h��������l#���G��X��A,��P�h�:Uq#����KU����g��(�cQ�<\ ����������
���U�P��J'Q$��|����D�Y���o�
���,�X�P%5 ����Y���=|�OG�m���x�Nm-��7��D)������u5�|�p���Xh�����BX���[��Q�U�Z�����|��&����b�����J�����Z��G���|��At;�==���D��
�)C3���H*��w�$�I����l>�/����?�V��e���b�j��mI�N���� Zs��^2S�,��bD��' /��=I���S"_dW���'�T�{���[mf��~F>[�#o�G�AR/a��Xl����\�S�n�u�a���<~@�[�6����w����W���F��m��7H��
��S
�ke�hS��tl9A�"3v�XU�Pn����O�7W?����"o��h�*9�8������F��������,R���XH@�tD0��:����n���)��������!nw8�^\��r~6��t�_e��I�j�������i0��q�Vb���uzuy9��;?+@W��H� J�����1�D�}����d�(���
+RcB4��;��^��A)�O�/
�P�Np��B8�>.'T2�Y4'�@�T��"s ���Qxb������R���h�D(_�Z ��4�U����sN� �$:"���{�y;�
;�_�lW^;J��2���H���������N�g�D�H$��
F�K,�����]�>�����s���U��y� �#g�TE�+t�5��N��:/���r#|�H��pA4.�aK8:yu�����,������(�;�F@���N�$���8)b���`��*�n�.��WU����������`����4�_�����<����f�V4���������C�)!P����-��-3J�:}����r�>�R�!
R6��n�{�������%F���r$�eh@�C� �'���Y�C�dC
}�������u�����#��;�����o'�3����10�j�Lsf�C�JBg��%�k���
����f����dpz��w#�9����6:�)hv���T�����?u�0JI�JK�
�2&���e���hqL�������x�9����$�$��)�?X(�kw��&"���/��zb��# �-�
a����e��I���z�!�lL���K��O3�'�\G�����������z���"�D��1��e&��/f��m^�+����VOsI���P��<T�G��[�b"�{<I�8�7;�"S�f���=*�#:��N�q J����| ��H����$������T���D��MmT�`R@M�S`ZX�Ja�F�����qV��*��D��eY6���1�)R���3���EX����^o��m���)��J#������(�X �H��_�/��g�'g *_2�3V�)+$����1E�a�k��j�$��a*�]Li�g�����������d��T3�E��\p�@+2�+^�Z%��&������D[� H~����O�g��5�����B�bb�k��m��Y*q`����2,.�V����*�<��>C�VD�g�P�90���2:�^*��Wg^���5����q�8c�d(�`��o��6��ZNq���!�:��G�@*%,X�u}��Xtz���qU���X���et�o��dd���`J[,�x"~������$�?��UK�i�Lr�<Z�-/,n�4x�X�rn��)��t�J�R{^`k��V�TIV�r��H�:sk�
�������$��w"��9�2����"T���Yy j�T"O�I�z���2��>�����q���[���W����.Y"GU<=Z6�0�p�)�CxL&�@8�u/�<��e����x��OW7�'����{���W�����!l)��z��(&L���i7���+P�u�07�#���=b�� "aJ�������g��H�?������M�c�
F�=;��'���C�6R���f�wO��U��LR�M�Xe�����2��X(�Q���@���AsM�4��if����H%��5�&���71)����R�d�pS�Y�*`�]���b!}�iT�g���0��!�ht:�VE�s�_�_4��#�~-8P5�N����'4�=i�������G)�j8�[f���>@(� ���OI���.09C2h�Uy����'���Bkn����yf���e�34���W�.0��wX=��`��j�xu�H����=����''���D)�u��Wc��)�z ���w@�_����2���zia�]����=��]Pz|~�7�g@��L5���8�Q�����K9y���x�7�J�$����1���b$����K��j��3d��OG!������������1DU��SW���6�q��N�(�?-��V������"����*����Q<��h`�J���w��m�c��?�2b72
�G�x���k>�~�Zk��j���a�HT��ZB�<L�1�Orb�s��L��N�X�j��.��H�'7������h�)����"Vo��~��2���XR v���ga,��$nc�����NO��r�9�Ed{�n��?b0fPK�F�VVE������`��V���3�[2r�i=w��!�w���Z���\��;��LF�]*2�F�b�p�K���a��z����^w�^�d{��`��1���1s$uG����n�����k�B����k���c�!w3�����
����V�"�Yah=�&��������.=SPx�3�����D���ln�T;r�%����CA��<�!���B�1A�
6�${3�?�z���&�K��'#�o ��17mdP�yE��_���|���C0uE���c@�����@��1�1���i��9t
���lX��'3[Y3}���j/�Q�������.���I�����\����Lj�+o����8Ka��t�=9�����1T��/��Zqh���'L�"g!�����:f����n�l�/�.(O�P�� ,���{F�%����1�<�k�U*�~�M.QB��TD�!�?\��Y�Yb�w;���#|>�@��W-9Q�v2���n1�VF������U�j�#*�����M���[���E�����g*V'`PX�:�c��������i���G�[ ��;�0m�As�QR��\�r�I�`�)��g(�y�*h��z���w�����r�^�IgQ�6� 6�by���D���^��MAPQ�_`���N�l����-Y��X��h�-P�Z��H~ry�c�IJ���7��o��AH}�1�I��Y4��a�-i~�}[�lL%h_`V����#�d6R$e]�7(l��Z+\�*L�jL,���a����|���Bq������s��� �
�J�q����,[��GB)���i������=T'
�p )wC��A��V�7�Bm]��Ts�.�]��1�2���2j��<&�)���2m����-oSUA7�� �cwy��a�_�������uEI���p'��4�(�������@M���}=�D5:���U���E�`��'��>�=j�\�+*}�.�8����L��c�u�<C�pC�M�;�F��A�P-~�I����.��R��`=�h���
��q�ZO��'S�8�+���d|A�g�z(M�����
|�{��}��*��\�S�i�E^����>`"� N�'��P�Gj-�o��� ���_s��cp������\x#Rz=a���eO}Y�e\p�5K��y;�s�����MZ����Oh��� ���4Q��nMI��Y1R�us�c"n�'�2+�JCw�*��d
�CUp@B����?���!���Jx�8�C�81T=;�|6�h82X1�R�A����
s�P��^][�������q�4b�!HE~����
�I��<*��8�������M�\4��
���I ����������}�L��:�R���G*������1U���YJ�R2�.���kO$�@�x��@lq�����B�&p ��-X9,7��/l� ��+!���e�-��������0�KfnB ��/�������������c�:O����dz��3|�r�,IQN���a�35�iS
R�J�n����+&��
�%\2�T�L������o�e��q����q���$.�M ��nlO�����^��k��6Y�����i�C�r�������UGSb�x�Lk��L�O��7��f��nP�q�MXnG��7�����|/�E2�Te��?���z�`�j���UF����F�c�����J���W/�osz��=+�*wGa��Uc��8� 6�o3��]���`�_�@��\�"�%��^�g��a-���~/R�M�[��p��=��MQg�-����Qz5��5
����8$��8��
�"?r3X$�B7P���5������;�6�0�Y)��/���Xp�e��+���S�*����<,{b�Lo
/�����G>����~�\;%PU���|�N�="�����X��U).���-M�F0���&���T
#�>��BP*���PX�3d������d"�
��sK�r��r��$�0m�����m��:��R1^�Q�/�f��s���`�C���F��9}#S�*��5�3 �_�Ycd��~������G7rG����5��M�����s�:��,�)���"��TWw�&���TP���%+P�3:B!r�������>KU��n���W]�����1o�hfI��G:��])��� � ���5T�����)>" ��2���qt�����������$I�=DS��M 0� ��* ��(
��&4���eA���6A�"R��t�5����#M��wiI �$,�t��z�-3^c����i�b���P/;�n��$���U�g���&���&�����Q������g���v��
�����>,��G��}�?��XL���'c�6�n�F���[V�����9*
�n�75F�kt�o�K9���hS����4V������i�W�ym����qD��B�Td ?k��jL�y�C�<Jc�aN�����d�P[P���c�H�n&t���DL��i�]4� n*5A�A&9�!�����O�`��/�3��O.H���(�0�H�DE�(8x�C���bJv��
?��eY�r��I��� ��_��+��#��<���/{��D������"�}d4�y��~�qQ�
6�������IK);M���]�u>���b�A���g��}������k�����ct��-�P����tG�f��� �JO�I�%�2�TfD,�o��P�4F�����M�tp~|t|�����h@"U��$[�vX���|_i}q/�4����/���I:Q���0��m���Y8���V�HG���)�b���Y/�{I���X�e@M�<3���R��>�w
qdF!B��&�G�8��s�^�G���U�}�{s
����jo�R�U�C������/G����@�hw��bJ ����*d�|�
����c��*����36�P5�TO�e�8�PB�����<���-I��������+�-�<���v�e���`^�GL�~��Y�<g#*s��I'\C�0���)0���cl8��Sw��E��X��$[����v# ��a�%6��dt�Ew��8�5
������CH�Q|e�LD���$���/2!����Q�(�c~���Tnk��l�W2p�O���'�b��
Y}�������X�LNH����r��p6�����_T�sv69>gg����_pA��z����m">YvY(�#������ye��{E�"�����M�1�9�K���Z����Z�������BGf6���2�Hls;[���n����\n���B�\��J�\�!�O;�h�?��m�*���)��v�O��FD���SW���8:<m�r|�������m�����������x`����_#Y_y�����\,��-\�H(���\�hk�S^��}4�P���",$�!%o>97�����
���o�o���v[��������@�a'��6c����������W���� �5����cye`V�}t �F���H��KC�>/�����3����4s�L��m�as��
d��_F��5I��g�����Z'��pt�C]�w��}]�dr5(�8�Z��������M<��R-�|�[3�wL+�^q�V�P�;�8������Q��6�)��HA���{�2�d#s��4�}73���F�`?�J��6�6wW-���������������V#�2����������5�c��;2��-S��/{ ����`��V�{���A���m���/�=��E'��$�>����Mm8!�U��H�<Z���%�0z�`k_'{�5c�aB9�b��u YiG.J�>��}�M`� �Q�-vV�g�)��u���F;3Hw��M�z3���+����n$='R�*�/����~@����U������p��BM��b�l?5j;7�c/1������0@#�QV���x�p7��+�|@n)���)h��%��R�+z!�*�[���h��a4k�X%g��6���o�k����\��]�f��F�������Y~���{;?��2*q�|�� d��������q����2��L����GOl��f��&�s�7 W���/�We5��� ��Sr�DR��P�weK�k�C������`�F�Br(�y3��F=���n46����;-���}�f<}�'%��A2.b��85��k�b��a'��?���k@���!����5����������0oI���S����c��K��x�����U��[z��K\��a��G�� )���H������+%"��d6���a2
���Y_"���$�R�!�E�\���)��A�2�4�N���;�����B�����K�L������#:F���!!cM�zi��a���0�x�%>��r�]�G��V�EM�|�4,t�������f�x��C��Jp:����b��-!��[b�{�e��.�}b]�9�A��
f�E�0w!8��z�r\{�A+A��]
�������5�a�n�5M����v��E�O5��^�N�-0~[PSI's#�0v��\b@/�'�������wX��� @,J���`�y�NE�w[��9�s��zC���������d H�*�t�h��h}WF���l6H$�����H/q�|���\��b��#\�j��}��p��y��Fse&���j�J�t=u=��u��(e_����=�����=@V��2
Z+B��Cp���KS.����.}���*w-���PC��� k8�*Z�Qo��7���v#d-��W�M�h"��f�&��I7�:i�_�l%
�����w7vw[�������������*-e�9%t��mi�l���v1�6 %��1���Y_r�6"d���.�L���4�2�����{S����Z>�'�q����,2��� 8��=�����^�?=����>�(���;���Dv/�����N���zxKO��������\��C�D/^D�U������UD @�o�^����l�|~|����w���yO�J�}��3��n�<��r�bn<�4 ����^�
������\��|#[�4�����U�V�|�w:{�N�����w���<�
{[��������D���dxC+�%�h[L�m
},�!���o�j:3
�*h�,�����>5{7&�|�.�s~�'�������(d>�\)=$',�����CX���m�(�/����9g�Y�c7��{��r?<��UrL���O~�A����U�� �G���k�u��;��f�K�}�B�?�Y+0�f���M�G�����>B��R��5�������n���lu����]�e]->9^c:6�;d"�O���ZDtN��I���&���K������]��,���!�o�=�&����3l ���n�[WP�P:r[���cBm 6�-v������_olP��������<o�x���}3SyG��������U,�|X�(�p>�o^\��>}{|�����m�k���\��7��(�`x��.P�j�#*��lz(�,����)A���<3����O�)�V��S��&]��~���
8
0�G�3������6�b+�&o��f��01{,=4rRx;���/���H|8�����Sag��X6�2�>~�g����iU��x{w�������=�^�-�����������+(W�=" j������W��J��e7[c��t�f�0/�y�Hve��,�^�_������s�p6z�`�6�[bz|�`��T7�F_
�W�u2=���V�]3t �Ca�2�&j�����i��t_
N����'��o�??�|:~�~�����h�RzgnLH�F�������k.���Fg�\*��>�K����q2�vc{���-b�~���)�E����
u;dk�Bs�\u�>@�F�Q����H�����.0ehBp'6w_"=��LF�>���6���i��S���cpxq~YI����_�%l�[mj���Y�����5U:�����F�����������W�NpE����\R���$����;�H(�F��=��,#pU���Db���*5��f��5�a��AH����25����8�F����9�fjsBxr�NP�>�-$�e�������x�l�>�1u����7�2%��e�0���g�Y/��~�(�`����wWH���a��#��&����k-��Uc���k#�g�@���m�-�j�DG�b��frUA��������������@L3$�>uQ��
���7��8.�g��V/�����fGG@�s2��p2������ �{���|��I��]���L�7I2Q��,��V�8��U�>���O^y�#����2�L�xD�ZJJ�]mr�����&�,��&�FM;�i���� �Lj8���;���D�5�|E��8H;���
@y�y[4��,��O�j]7Xo��ac�d�w,�i�Ob({����g0E'�q�915��t�� ��fDX�����2�����/��d�b$_g�4� ����K�T����5�Hx�]Y�`�5�?L���!V�����h+u�����W
V��y���*3��&��`<�vJ�*�a�!9�k�jf��Y
hH����o�]��Ci���(>��9q���a��4]MF��r�h� k��?���,�����Du�����H��$��N��^��S���v]o��GH{����*����)�Y����U�G�����t���c �"R�����`q����99��H��2~�M���0Hl�/�w]H��� h4��<���Yg�e��^���.-d�`�1>
3u~����#�����=oFWWDd�j�\r_�a��������G���3�� �
.8<9��0�����[�1���".4��&�%�x)��V��*�.O:��]��_���^5��OP{�����Bi5l���e?=�1��{H��MX[�+����������c+�������l��@��&M�hN-�~�A���D�(_Y�g����G��P��6dy�>r�{Op������9���w�8di�P(3e�P��o��Z����r���W�O��z��8�0���[cc5!�"<e���c6t����
�7��� ���T4)*��:�E7�6�|��4{���c��.!Fp&]W��Fh���k�6V�����qf��
�hxO9.�g� �����a%�\|q�J#rU�������Vt"���;IW��7���y��5��y��6g� �1^��Ah�*td�(� Gis 0��ec�(z����������Y ��O�!SE@J���-s�{���y�����fq��j]�T&������]�������NM��`/��pgT$�;���D��c8%w]Lwi�)1,�����aS3���z�����dF$C��$�z�czG�@����@��l��lf�o���l���|}���S�4 ����_3QB����&t��$,hB�0I�h!�V������;�Rt��� )�~�N�?�\d�!aX���S���%&�*�GJ�S��Cu�Vq����L�ei�kA�aU�����[kV��=T T�����3"���zc{+�o��469Yg��e�1C,�q�yL�����jr% �fV�'��i�2����`��'Md��� ����� �1��Y��iX�v*��8r�\�S�p
@���n1��B���r'�k�K��b��X�HCY���f99��aX�
�S�gA9��3b"z1��1j�=�����sh�2�X��{�K�Ef����k�4�����E� 6���a��#���~O&#N����L��(����Zz��lu�+Y���W�U���h�S�s�I?����E�W^G���)���5����P#3K�)HR�P�!s��t��������T$n2N��9Y�.B���O��l$P�QA���M�S��K��E�$!G�G����76;���&����<g������4jZ�R������G���0�����B�����O_�59bnl���c�\���\�����xNw�u"���$u��
x+lj�
c�5�$r��1�b�"��P ���q+����24^�V���yM��T/&d���,x���(�9���I�����l��u�b�l�p��A�&��T����J4�Q�5��[�� ��k��
����E�N�-F�I�2�oD����HT�Y��� 3�tO]��D�H(��
�|�|�R�*3��
���������E�_����r��~1����X�&�y�j����j��?�$V3
,�=�$:Q�b�����R9�3�8;�g�0~I���\m�YI�F��rF�O���w��/�u�����&�>��fM�G�G'5/�y���7^ur��0��R �8���1�b�kR=0�p��J30�o:�#�(8�f�=�8�6|Loz3
�^r������+0���� SO:�]�����E��j����a�����`�M�:f��K�d��'��I�,&��
�5Ap9�� �,2�Pl1C�v�'��h����2E�V�[#�"B�����d�6�wY�� j
���a8�zS9������e3���}e�GyY{�9��t��&#������/���xB���T#
�|�V'^A��}4eBV���01+�&M����N�ZB�Pz�:<�Te�����B�
����@����� ��N�X�D����ww%��U��4�����ym�O'����m�����"-k�z�f\���
�/{�9�����b��`D��:,L�]���dL�@�\��.�����������R�O����3��T-l_�Z��652� A\�|���P�� ��]��"GW�Af��h��0��EN�l�gC7����������Zg�2�
��@�l�����,I�E�Ry��K�����{������1[����0%�TS��V��!\UW����w�f"�vI�Z����3�1"�;��R�iJP�*GC��Y�G0�p+\���+�9h�"�D���T�K��0�m�����d���b"4���j�������:�omn )�|D@ �`���%
����
pi�>�� ���@�c ����q�`d[��%����!#�Q���N�U�P��������j�12X�_FFB$e����K�FJ����8���g��o��JMoEi�(b�M5~;x2����\���Ia��o~���d�%O�(�&a��?���lY���$��kG���E�NF}�H(�@"C } ��_A��y������m��3���*�����6��L�Z��tP�� �����P�w�zcdU�!i��!����������}��%�O������0�O5�\����#�o��(��e�^�:�X���6A�*m�f��#.F��"�*&����p#4��q�������X�FwX:���^i�Fhb����_:1�2����+���.�i��B|)B�[��9�:�#��>�1%�5d%A�5��c�������7g�#����I�K�X7m@��Qw�����_?|��9���Bk��8o��!�-����\V3�=`�u� ����r��5%��J�����W�b3CdNb��l�<���){s7�� q#\
#36{�mT\k*��31�Q��S"�5t�=�'��e�!���,6���������N���m�677w�{�C�s�,�x J�������Z�����Z ��{sW!d���y�!^!]��z���`B�5n��K���u��,s�����D�`]�q`q�?��,������E���9��l���������?��+�F�^T{�����ifX��Z��x�������j�O?��h
��j�g\����
�K*n[=�����3x���<�L�������' ���?L!�Z������"xA�K��q�.�]�4�/C���'�.��{te�qr�n*��Tn��d4�:���8� /�����V�6u�9I�l����V6�
��&�=Ec��">�����n4"�u3_K����3�a|�����pT�\9�����^0t���p�9K�����T�<$�cm��Hl��e�V>�a��A�Y4���_�2��M�8(;��7����,����}yGe�x7��T��_VD���V�7G�\�f7o
9/��{T5��^�k��kkc�+��� D�(VA�t�r��\$T�H��
rrQF� �`�D�w R �@�Rd�<�m��?������_��AR��
<�)}f~���07m_6��D�H?��D:j7��@Cq���G��F��������H�WO��44N�MA,� �;��2{�����3�@=�
J��^@d����3�%b�^�����/JH$RS;fO�3u&�Ye�e�-�|�b��qu(�N�-�0��ISW��;��:;��gB-{�N��/�� zAj�`��g2���N�����������(�d(��;��&�
Om2����&��e��*U��[���2�������K�Z9;���-{s-}k�y�j��?.~��+_�{�����_��Rf�ZN���j3������,��s�������zN�N��������W���Mi�Q4�����s��7P��@�T��6��N�+�5�*;�(R����4x����-��IIz�.�M��D^d����4FFt��s�+�2�D �T���?��k��#B�Tlx7�W#���5���g�@�����d,�N�"�g�R�}�%����^�FS�)�s����d��/$(MhB�Ry��o;������D��D�_�A ��8Q��g�M����G{4l���-xsi>���e��8��$����vMgM���i��h1Of0N��
>4��^�)�H'��M2�I�����z,���%�oq�E��$���<�x8�>�C"�J��y��G�-���:rx�S#��i���$�;g������k@�t4sh����7)�&���������z7�G������,����]8V�jF���Q�6�/nZ2��/��V��n�?����wOT� ���R-�"�s�s�.���@���Nua��x
����Z9�1$�)Q�"fu��xMYH�,A���!���l�D�,�T�l�%�F�X�8<���L��Yv�@
`�����IMj#q}������
����hC�����g��ubD[�f�����G��( !��6�K�����Z�zj)�������H����;���g)��[����'u���B��Zl�$�jti���5|Uye�C�
W7}C9s�V�g���L�D �R_����=�fX�?]���c�g�a�C�(�4Cq� �Z�|� d�M�e�e�g�,.�\������2 �"���
e���n�4�n�N= Yo�Z�<�_�P�<nYq�( S9�W�4��Xk��d���JS�����oX�|J�V��c$�f��rD��������+o�������{j ���I*�Qs&��C����������������^��Q�[�� ����D��LjGf�FF��9+�������7.N�v�f�&�6p<Qt��q�ek���03��|3�]������G���)�dB��������v�T)���i(=��
.2������$�R!��~l,��H�!wenu�?8����3�0���hE>����C��3 Q�X-T���h���< ���������mMT����_���p��@������2P��iP�Iso]�kz!m��i�Z_�k��"���c!��).�k�f'��L�����_h��^�m��5�.2?-R\�����/de��v�-�c���p��n���_n���s�=gy=�d�{ls�h�D��36k���X��6G����F��/�?�E�����u�q��/Z7�L����1�f�|UH���U�����6�8ks������B�"��k�����r����~�D�(X���u���+���������(��#��
��u���4���K�!�p!F<����!�� �p9l��3���jA���R���(��(z<��;�y��R��R���/
�TG5��y���� �����?^�*�D��:��e�!V���#������ �$�v�&!S�7�U�E@�w���8�����M,�.P��SPz�P��K�p�T���������N�I�@ ~0��F�VhJt��\ly���DJ'az��9A��P_�
'C���|z/i�Y�a�{_�����#�<cW��t�>2Y�#N-�.�O�Y�s\��>]�����k�L�*����g��{7!q�;
t��N�e��h�S��\�R�q���'N��������5��1��,�m�e'x\���IPu�"h��5��� .�;�9���x��?��=��`
��r����9�~��y���D��;�I���~ne���c�-�����0�������(&��J�.����H����L�B��.U��$��}=�X-]F���b����y�ei��Hn'�q3�����?T)����������>0���8L��RMs���.p�$���Nc�(������?g�`���s�&+����)�p����{��P�q��+����^�4%n��?X������42��+�p�QiT���~�N 'W���/�c�}�Z��z�o>0�,�9�o���)��G�
V hv��#�
Y�G+|uy\�C@Ql��(�����_�`����y���h����z�o��w��$����0�]�C���Fq�_���0j�Y�}�Z�r�������Q�|�o������M������&�t���A_
D����jqF~(�R�;���j#�����g�^p6��#������j*��=��G$��A���'M���;��!Sv�&�8���4,���2BMW�������6����]*�z#�p�����"���}Y[���g�XY8�9���E�#^C�9������Vkwws�rq����&���qT�w"[�.��
1p�=�!��}��uy?]��U |pAgbkD��,m��2U<K[Y���w��V�G~Mk��e9=�x ��(s��#r������{�9�P==��+�vr�������m�`��mV�T�@��p�R����8�������,�����?�y3���g�v���-l��x��U�����LJ�I����Y��7{Mwg��,n��l��rv�F��v�����7�;���*��S�qF��������sP .<�A�M�������P
k�����c���1 ���L ��!�T��\�1P/� ��T|`��,ku��\g�i%������A�����/�K�U��B�����r���4��j���D�)���~P��:3�����/��K$:���Ip5�=g����t��;+��lkp ��}M d/$��ri�Vh��>�bwI��4���3N+
���/�d��w%K)G���q�F�d~T�{I�����>j��@QX�?� ���3�0�%��Eh�L�GB��; � ����GnZ�����K��������`(��Ko�l{A�%��|�S� \���F��/�7����i%_"{�}W2�$�V�������:<�G>5I�r��( R�b� k��+���2�2��`���i:��
� -f���t�V����,y���L1��d�B!V �
�YL+������jWv����x�MF������!��pd_q��9����)l�|��W�>���
�������q�
��G���� 1��f�|�����'�?YA��!;*�_�Z���1�k�Z���_��%�@=�}����������.�2+5@�)7��� h�k��5v6���Fc{;gb]�Xx �qO����������[��$������(@iX�^~ ?OF�'��Es^;�8[0�z�w�IW ���u%w6�%���"<����7�l�y����'{�'���4�Jj
9�v.��L����U�*;�����*"����n���7�?�O��`���g���S0��'��?�N
�8�L0j;���!1�~�cw�Ut�1;����u��������A�pQ*���������b�
G<�����]�?,.E�)d���2��H<P���Q�B�=8YK�T�[o������������2ow�J����r7�|����-����@��n~���,�<[O�w$��WVJmVH�fQ-K7���������/�=�������}��f��-S�Bus�P��BuS\^O8���J���vcc�[�����R 6"��O��r�U�%)�B
�Wa��Z������(�.��������|~T���+�����A�cg^�$�%�X`�jV��
��n�k�[D�uI��$ D�r��m��t���9��[BAzN���!i���-��4m��8�X��]�� Y<�B���s]���i����M�+`��)j�/�GI���A �.f�4����������C� �r9Q9R�����7$�K,=�����a��F&�x_�rgc��veOi�4���B:p?�^1b�[S�rA$�&�A���Ns�)��p��Od��U?H)�0����PH��.�#e���yP����Kh��i\Qd����7��d�a���Q�u�>��B�c�~+'�H�~�~��>���,5p[h�� �[�P�!*�%�x^��W`62��4����'@y��H���p������F%w�/�0�&�����3v7@?�^���N��d���MJ
?'��^y����@A�SdX���������D#L4�w2u)���=Jx^(���:�/����1��EL��IS�ALe�^���Yv��j��s�Z4^�����I���-�O��
����
�����|"lWp1���D������zgf>�c.��{��R����Fh��,����9C�PLz����Q��O��b�S'l9�<
�C������0�a���F�M��� ��vF��D��lB�`}��P/5������A�-�ZU,m!]�>32�d4�T�����'#�D�{|�&��K����(F���������l��KQ�~�8���� ������h�����2������KF�`�zq68)!,8`R�v,�O?�+�y�+��CC�,X�c�Cj����A�0:���(^}���f��P�9=��s�������|���
[s���!���.�^�<W��4D��\����
3[2�k���q�?*$� /'0�4�<kZRkv�+��R�-%��x��b+G�T�#�&!��'����1��GRO^���B �,�@irV7�����oK�q����E���R.IE"%d\�F�%r�����=�9�������F�aiGK��nl���E�������� �~��;3m<d���Rje�V� {�flp����&���,���E�k�p�����&�a������f��9��p5ZB��o�X��
#�p��� 9��VZ}���)��� xN!E��D��=����" S�DJ�f��5��������S�2"������u�S��b�d��%D�s�A�n �J�i���B�J�.l1 5��-"�����$�����T�G��~����b)�)�la�q�sM(��Ac���%ys�ai��q���k��&\���z��-��{k������Va��j|nM��p��M�'���DM�7�NS����j!N�k�['2�����Hyr�����i������F�y Bw�j��aR$�s�.�)���\�z9����`m�����
��;*D�A\��=�iy��<m�}� ��K�q�MCsy���o�/wdk�$�Az5��V�B�������(�}`�Ui�t���>V����R�J�������a���-[C%���Z��$LRW��������(���
�sS�#zhJ�%����;��0�w:�XL��p�4V����,�2 ;�����J�������BUa�*y���3Y�2.��:�DbN�T�<T��,y:f��+(8WH1b8����5���S;�S�`AO��G,�n�������
d����(�*��UEE��2��g��5�O�2�
�O,���$�����~��Z��foss�h~���p~��/�u�s�ds��0{G��
5 a��$z|���%�J�_�hh�o��Jv�%�����S�%��/���S�����k�d�����nI6��P�������W�=*���_�M������1R,��h����?���L e"c����l#�?v�~q�c�H��z�S����L��o�o����~wx�����MJZ�����5Kf��0�������v����h-�
[sJ/%� f��������{a�_�<17]'`�P��V�}K2�BfF+��z��1����SZjFo�L�;�0���<��<���i����������7�a|Sr����P�]c(;��ei������#Zg&�G��$`q�����Y�G���y�mU��l0 �t�������R|��������,|�=����J�y�Y]4r���gc�GAj���3A�����~�V��o��Y�>E���eOS����������B��}[�#�W��8������n�B�����S������Wp,va���o�!����})D|����%����������*��=�
�Vm6K;d�����E���FuhB����o�"$�&*��:-([iP��-�^�&���b���=����x�#<�"C��U�����I+�p��=��W�68Z��V>3w�@RSi��0��&Q(z��������[�]z����q{�wn�,KM��7M�����|c{�/t�~�����E��y��jT��R����r������8�N��Y���wsI�������]�@�R��2f����uA�N�!i�C��9/S�J�6%ks%a���A5���L&����`b�/����^mU���������=x��u?���u?���u������^�����y�,�~cck�UV�G����sg�4C������H�T����a�z8���<$���[������
��I1�)�&AO�qH��^[kq��c�a� 4?����^\v��,��)��Vt6a �/2�}L{���2"8���2�
� �8��_�s�)~�3R������S�oZ#��0j�����z������y@�[���|+,��*8��>�hVWF�
}�I����/��� ���{+h]�1���J�/����on��93�g���vS�mM���a��3ZK��/�Z�6�
4 S�0`[��^�+�~��Pa.�L$��E�@b{$���6�����r��rOr�j!��_G����%�U'�����U���ajN�V�ll��9\|��e���|
�E�;<=9r�5� ���?����Pd��)!P�mY���x����l#�J,#{�[���Vko�`k����e����.b[2�� ��u�����&7![&�)��i�$"(�)��1�t�T��u3[���F�����_��d�����2/C�U��GR�P��0(�k�K�2x}:$`��"�0u���_�O���h��/#���t�0���/�7�������[?cPu�{�d��[���(�r+���!z.,Zm�}����8XF��)����S�&<���g�O��O�^.=�~T[%�N������N���d!�5"-���",��"�W��������I1���3���&�nHM~���Z$v�O�.�J]��
��Uo�@���n��?P��"��|�����}J
�
2���d8� �j�r���q $������e%��Py�S����$�L�� �-#��:A0
vB�$�h�E�^]rk�����M�@�n:5J�`��v|r�6��d�K��yj��U;�&OF"p{��5-�:�a~��7� �@AIE��gM##d �������,WC$q�JY>��T��^��1�O��I�y�y�?J�U���=*�Q�����-}����������p��Ko
�����C�uaC�6d��.=���~�[/~�%Y�h���c2����p=����w^I)E���6� qG�Jw��1*���"N^\��>}{|�^���FJ����` p'd��p�d�d�����>��?�VP�3?�ao����-�<Skt�p����(��$��b�u���4Cx�P��X�@ys<��&��D��:���
�)�C���H�W�\4�_���
��>��|e+T�L��%;\��������U��w8�SE����*V��9VW ?f�������t���kr(c�0�F�O�ke��� 9��
-1{�4�V�*(��"��#i����)�jw� � ��q�O8�$��Z*q�>"�"��51bm#��W���95/��.� ;��s{S���lR8��� �5e)����u?�k�4s��G�I'9����K<������lH�6��9_!+��/�������d�Ir)�j�(����eL��\ ~{oC .�^Jp��vBfn����@��Q����5]�����@�z�KO�9V~�a�v)�4�coH�9��l5���&���X�b7�2fl{j�r�86��|���HRc@�z"���g}K<�cT~
������n�� ��@��cDQ�8,7!
aq���l��W�%P��=|Xe ��*.���)��"�9[���aS$�P��E�����^�I��^�OK�1�����;��<[��0}��j�?�{������(=�&/�����j�e���R
Rt���Xkn�zZ~���d���xU�"�z�X8�����b`ZY���U%
�B\�y�0��y���~��8�����L!1?���&��ZS��pE�5q�H&�M'^V �|-����2�J� �`��>��$���M������|���l>��zJ�S�:��SZs���e��
�w�.�W�i�~�~�F���./��z��]MsC;��������!�%�3���� ���cT�����b���+ �+s�&�Ic!bE�6;yx�����:�s�tN�� ��;l���amd���\�D���2C��xB��%�*��jX3;���
�/r�����+��� ���n��\X�;��o�A�3<U����S��!�I^���`�^j��2���
��
��R�����P�HT���hOd��=�[�0��5�/��D)@\Z��?�������
������\c�\}�t~�~������"�=9��d�b��2����zI�Q��������(k�������'���������a���]<R�\U�i�����Sr��`�������5��,���G����������H[aF�T�Pb -�M-�������=X����C�u���iA�4�XX���xz���I!%���f<+X��trJ}+���w���&D��}�����p^kL�=�����Y�LZk!�g�"_{^�r�2@�)���O6k�$��������)Yk����d�����4������T��������X?�6��+��������7���7�{��L��0��>����qM�����a�R�����ga'����W��O�U�]�'�b`iVCz@� �
Pz���jH���s=b�d#ta�)k;����N��?��sA�`) S���c��P����Y��������V�8��L�V8�� ���p�u�s�����������R��M}��(!���[�Ijz.�\jLF��'����*0���x:���?Bd�0���V<�9�����4��aw-/��U�z���1�aiH��dD�{dU!�/'���+�u������nl�����������R��q���e/Zg���!(���u�8]�^%����K��6��D�n��aL�4�`��kNs�0��g���^#��Mg���6�����w)��V�PX-������9���
_7b��es�� P(.�":�t�v��^�0J�iJ�
�P?x��|d��L�[[f�l��������c����u���#����wS��>�[������t�ws�h�3�|T������/���Y3^.����Z�R�l"xvccW�g;�zx*�3��QA���q�`y�:\���Ra��4�P|��`���-k��(����)B�\,�Y���-��R/�\y��s1O:�61l{���������*��x�=��r�'��*c��2gv�k��"���S�a��B���_��Q�'* ������A����������;0+�N �-7�e)bWp�R0��`c3�.�D�
�lL�)�����������#k.t����C#���"�?dG��g����KW�~kX��x����"�-KKuT����m���bH1#�]����6����)(9eg��Y��������\T /��7���=�\@��[��9
�[���pk�����9c~�p/�����!�&��-��|d�[S����
Lra�"�H4� ��^"H�U�q8e�[����h�>������e��
�|�nW �����o�����&�P�A��l���*s\����������������5�A����)
E�W%�Q���/1����G��wg�����i�,
�8���M�)�l�MQ)��I�fM�m��6�]rZ���@�&n�������D���������k�oPpY�G����8�d�`�!�*A�e�8�y��J.����onl��� o��m��;
U�����M1���S�}=@I>����A��������_�A����0���]����`=�_s����K2� ������UB]@�u�}�� a����nmB�"�-}�����y=q5"� /h���H����u�Z�Y�n��
B���P���q���\���E�����N2��r�������a8�T2����'�M�;������y��t�=_��V�����PP|�bMQ�{Fs�D��Y�����$�=�����4y���/��=�(�d<MG#lug�8r�f
�}A����m���D����3>�0+����M y�u�<<��Q\� �N<T��=f�����C���#�&�C��Z%,[�*lsb)�����w�����]2-N"��;f�DL�����>�\/i�Gb�1�E�a!:8C��F����Q`-��
P���^�� :6��%��P��Z�
� ������H�d���T�r�b��+����Ad��y���E/M�Y>�o��$�K��������G��s��m^�~`X�M��K��tJp7!��t�$o� ��������9�Vc��xL���Z��g�[l1�hR�`>�PW$�d��e[)*��%�v]
_������Z�OrvHC@d�p����2���-����5� �m�t1 �1�!�,����=1�=�[�*���w�s�S�]^�"le�I{����2���Q����,�oF����[�5�R�r6�+n��+q����� �&D1�}��Sz�pR���C;��0�_<�.���z�w#!��?9|s�����������gd(�a'�h�U���P��2���D���Df&�EI�N4sN���
e�����@�������2�������Z����x��|���8_����~}��8�Idz�\�����}���
�����L����
��:��}@����
�l3�0��j��M�]\��;
���z��mqD.Ige�����J���:�G�s�T�8*yK� O��k=��g���Aq'fj���������L0z�S�Tz��tS9���m�`�o���P0�>��(�|:�x���=-���mVd�Z���T���_�e��z�7i��C����V�J�>I�6���\S����<j���)����2��~Vt_������k�.$��"^M�J[�-K���(0�6#�m�LQEP�V��sVS���$������f����/I��\�um#Y�U���5�_ �?���Zn������Z��0����g��5z������r����
��J �h����7;����E����������7.N�vL+q=H�')RE�����lAYC�PJ�d��ia�i0����E�:�&1UP�JCP��7�l��py�]��y@���&,�T��3H����Q�2����q��Xp`B���<A ��d�r8*�� ��J��y�-�e��3:�@4�1��'$bS�����L��e1�V!<�e�]P�4:O�>������P��uiEZ�9�������$oC��]k���`���]q��~=�f���8��5��n�{����p�'���=�'��m<.~e����y-������U/gi��P�E�r��������#}F������^�������1* .�d'��h���hAf�Q]���#����6���n�s.-1R_��U$cC$��?z���W
�H���'�jr������Y��R��7����� �(%XF1�;������H�����>�=a�[�@*S�.s2�G�����}�_jd^��
�����:�B���=C���a�T����|��o��f��}ot;�K�L���@�|����� ?��/���F��_~x���.���-rq]�J<w�2'�>�"�xV��T�ZB�^F0g�&���[��������?�!���p ��F����-�����K$n��^�����bw��������}_��^�������J��#m�G��X���%�z}sG�����a�d�iC#����i��,Qg�U��BQ�QFB<�����6�gv���`=�d��<f���\Dyx��x+�96�1X�.��������jc�a���?�6�B��~)����$��Z�����z<s:�x���k@�6T���&��V���T���jzB?���x��~P��D��\��2V������:���g����g���=�;���/��fD�_��}�����Ev����J��g��k��[�,k���i��������8�E�B��EMIr��w�QB�R#�g0Ln�����/���d�{2�U��N�����P�~���9����R)��t0Dogf�U�&&A�������x+)\��_�;������������?_M�������o� ���X���\�n�q:�l�1] �8�8�!��}b�c��` !�Gl[+j�*H8a-���_\�����Ngb1�C�2�R���h~��e��o?�O�_����)�z�.��3��7�������r�"L���y_f���;�:5��v�E����v�h�H��Y9�&s��j�_A2��B@���N�c���rV�=Q�+l ��)6�Z�b��y��QkW���g|���������bu?��.�����o3����L ������]�}L?CbG��gI�<�6m2b��66p>��(�W�rsS��6��l6�
�:�&��z���v�r�kD/?����9#�_�rx~,�0��d9�j��/L�Y/M���?���hxG���-z�S�a]Pi�L �Z`X��_�������q�������fvgh~���&7Y+�g����� ����;������oo��-f����yi�)�P�J'��>|~|��8����GR�
�l�~I�w�������.Z�������8l�S�����/s��.���]{��������go1t���^5�L�7���9E� �e�}�^>�r|~l:����`�O_����~����U��HJ��O
�l_��)�NW|��r�f�[��yy�h!?�n��cA�����[2�y_�DJ6�Q\!oq�v�%z��
���+|L�p���7���'��y��%�}(��l���>�-W���B���&l`j<�RPda1jD�or���>���������F5],%�y����6b�0���^,�Sl ���~�������n��m��N�.J�R���PF�0�4���hgT��=�d�V
Q���� %���wG��~���5�!��� W�D�(��2��
���x:�>��phn���h��~�!QR�"��#C��}��xA������Rv�*��R�o�R�o~�Ry#'���V��GC�j����>M8�p65�)WA�?��Nf)i�������r�Q�1�<�zb��@(�C�V�m�,w���\"�����#�\a!5� �6J���_��++�u�ujM���a���bT���������kx`L5��h L��f�5�S�I�W���k�����v���|��wU/vU!��!F��+��9���<����e�p9�Zx��fos}o���?H����ex�t0��I�����?�$�}K�D������5�\lb�cQ�$9:a��FgaUS����?Z���K��'