Exposing the Xact commit order to the user

Started by Jan Wieckalmost 16 years ago102 messageshackers
Jump to latest
#1Jan Wieck
JanWieck@Yahoo.com

In some systems (data warehousing, replication), the order of commits is
important, since that is the order in which changes have become visible.
This information could theoretically be extracted from the WAL, but
scanning the entire WAL just to extract this tidbit of information would
be excruciatingly painful.

The following is based on ideas that emerged during last weeks PGCon.
Consider it an implementation proposal, if you like.

We introduce a new set of files. The files represent segments of an
infinite array of structures. The present segments are the available
"window" of data. Similar to CLOG files, the individual file name will
represent the high bits of a "serial" number, the offset of the record
inside the file represents the low bits of the "serial".

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

Each record of the Transaction Commit Info consists of

txid xci_transaction_id
timestamptz xci_begin_timestamp
timestamptz xci_commit_timestamp
int64 xci_total_rowcount

32 bytes total.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.
The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
commit_timestamp is what CommitTransaction() just decided to write into
the WAL commit record and the total_rowcount is the sum of inserted,
updated and deleted heap tuples during the transaction, which should be
easily available from the statistics collector, unless row stats are
disabled, in which case the datum would be zero.

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are
outside the postgresql.conf configuration with respect to maximum age
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.

Comments, suggestions?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#2Greg Sabino Mullane
greg@turnstep.com
In reply to: Jan Wieck (#1)
Re: Exposing the Xact commit order to the user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

In light of the proposed purging scheme, how would it be able to distinguish
between those two cases (nothing there yet vs. was there but purged)?

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005231646
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv5lIAACgkQvJuQZxSWSsiR3gCgvyK/NPd6WmKGUqdo/3fdWIR7
LAQAoJqk3gYpEgtjw10gINDKFXTAnWO5
=sSvK
-----END PGP SIGNATURE-----

#3Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#1)
Re: Exposing the Xact commit order to the user

On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

It would be nice if this could just be managed out of shared_buffers
rather than needing to configure a separate pool just for this
feature. But, I'm not sure how much work that is, and if it turns out
to be too ugly then I'd say it's not a hard requirement. In general,
I think we talked during the meeting about the desirability of folding
specific pools into shared_buffers rather than managing them
separately, but I'm not aware that we have any cases where we do that
today so it might be hard (or not).

Each record of the Transaction Commit Info consists of

    txid          xci_transaction_id
    timestamptz   xci_begin_timestamp
    timestamptz   xci_commit_timestamp
    int64         xci_total_rowcount

32 bytes total.

Are we sure it's worth including the row count? I wonder if we ought
to leave that out and let individual clients of the mechanism track
that if they're so inclined, especially since it won't be reliable
anyway.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.

What happens when you need to switch pages?

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

It would certainly be better if we didn't to bloat the commit xlog
records to do this. Is there any way to avoid that?

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Right.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are
 outside the postgresql.conf configuration with respect to maximum age
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.

Dunno if autovacuuming this is the right way to go. Seems like that
could leave to replication breaks, and it's also more work than not
doing that. I'd just say that if you turn this on you're responsible
for pruning it, full stop.

Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Greg Sabino Mullane (#2)
Re: Exposing the Xact commit order to the user

On 5/23/2010 4:48 PM, Greg Sabino Mullane wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

In light of the proposed purging scheme, how would it be able to distinguish
between those two cases (nothing there yet vs. was there but purged)?

There is a difference between an empty result set and an exception.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#5Jan Wieck
JanWieck@Yahoo.com
In reply to: Robert Haas (#3)
Re: Exposing the Xact commit order to the user

On 5/23/2010 8:38 PM, Robert Haas wrote:

On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

It would be nice if this could just be managed out of shared_buffers
rather than needing to configure a separate pool just for this
feature. But, I'm not sure how much work that is, and if it turns out
to be too ugly then I'd say it's not a hard requirement. In general,
I think we talked during the meeting about the desirability of folding
specific pools into shared_buffers rather than managing them
separately, but I'm not aware that we have any cases where we do that
today so it might be hard (or not).

I'm not sure the retention policies of the shared buffer cache, the WAL
buffers, CLOG buffers and every other thing we try to cache are that
easy to fold into one single set of logic. But I'm all ears.

Each record of the Transaction Commit Info consists of

txid xci_transaction_id
timestamptz xci_begin_timestamp
timestamptz xci_commit_timestamp
int64 xci_total_rowcount

32 bytes total.

Are we sure it's worth including the row count? I wonder if we ought
to leave that out and let individual clients of the mechanism track
that if they're so inclined, especially since it won't be reliable
anyway.

Nope, we (my belly and I) are not sure about the absolute worth of the
row count. It would be a convenient number to have there, but I can live
without it.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.

What happens when you need to switch pages?

Then the code will have to grab another free buffer or evict one.

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

It would certainly be better if we didn't to bloat the commit xlog
records to do this. Is there any way to avoid that?

If you can tell me how a crash recovering system can figure out what the
exact "sequence" number of the WAL commit record at hand should be,
let's rip it.

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Right.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are
outside the postgresql.conf configuration with respect to maximum age
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.

Dunno if autovacuuming this is the right way to go. Seems like that
could leave to replication breaks, and it's also more work than not
doing that. I'd just say that if you turn this on you're responsible
for pruning it, full stop.

It is an option. "Keep it until I tell you" is a perfectly valid
configuration option. One you probably don't want to forget about, but
valid none the less.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#6Greg Sabino Mullane
greg@turnstep.com
In reply to: Jan Wieck (#4)
Re: Exposing the Xact commit order to the user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

In light of the proposed purging scheme, how would it be able to distinguish
between those two cases (nothing there yet vs. was there but purged)?

There is a difference between an empty result set and an exception.

No, I meant how will the *function* know, if a superuser and/or some
background process can purge records at any time?

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005240928
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv6f0UACgkQvJuQZxSWSsh0xwCgmXLtKngoBBYX0TxDM2TlJRId
AVIAoMHYa3c9Ej2vUJyFufxBR5vDPzQ+
=e1mh
-----END PGP SIGNATURE-----

#7Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#5)
Re: Exposing the Xact commit order to the user

On Sun, May 23, 2010 at 9:44 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

I'm not sure the retention policies of the shared buffer cache, the WAL
buffers, CLOG buffers and every other thing we try to cache are that easy to
fold into one single set of logic. But I'm all ears.

I'm not sure either, although it seems like LRU ought to be good
enough for most things. I'm more worried about things like whether
the BufferDesc abstraction is going to get in the way.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.

What happens when you need to switch pages?

Then the code will have to grab another free buffer or evict one.

Hopefully not while holding a spin lock. :-)

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

It would certainly be better if we didn't to bloat the commit xlog
records to do this.  Is there any way to avoid that?

If you can tell me how a crash recovering system can figure out what the
exact "sequence" number of the WAL commit record at hand should be, let's
rip it.

Hmm... could we get away with WAL-logging the next sequence number
just once per checkpoint? When you replay the checkpoint record, you
update the control file with the sequence number. Then all the
commits up through the next checkpoint just use consecutive numbers
starting at that value.

It is an option. "Keep it until I tell you" is a perfectly valid
configuration option. One you probably don't want to forget about, but valid
none the less.

As Tom is fond of saying, if it breaks, you get to keep both pieces.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#7)
Re: Exposing the Xact commit order to the user

Jan Wieck wrote:

In some systems (data warehousing, replication), the order of
commits is important, since that is the order in which changes
have become visible.

This issue intersects with the serializable work I've been doing.
While in database transactions using S2PL the above is true, in
snapshot isolation and the SSI implementation of serializable
transactions, it's not. In particular, the snapshot anomalies which
can cause non-serializable behavior happen precisely because the
apparent order of execution doesn't match anything so linear as
order of commit.

I'll raise that receipting example again. You have transactions
which grab the current deposit data and insert it into receipts, as
payments are received. At some point in the afternoon, the deposit
date in a control table is changed to the next day, so that the
receipts up to that point can be deposited during banking hours with
the current date as their deposit date. A report is printed (and
likely a transfer transaction recorded to move "cash in drawer" to
"cash in checking", but I'll ignore that aspect for this example).
Some receipts may not be committed when the update to the date in
the control table is committed.

This is "eventually consistent" -- once all the receipts with the
old date commit or roll back the database is OK, but until then you
might be able to select the new date in the control table and the
set of receipts matching the old date without the database telling
you that you're missing data. The new serializable implementation
fixes this, but there are open R&D items (due to the need to discuss
the issues) on the related Wiki page related to hot standby and
other replication. Will we be able to support transactional
integrity on slave machines?

What if the update to the control table and the insert of receipts
all happen on the master, but someone decides to move the (now
happily working correctly with serializable transactions) reporting
to a slave machine? (And by the way, don't get too hung up on this
particular example, I could generate dozens more on demand -- the
point is that order of commit doesn't always correspond to apparent
order of execution; in this case the receipts *appear* to have
executed first, because they are using a value "later" updated to
something else by a different transaction, even though that other
transaction *committed* first.)

Replicating or recreating the whole predicate locking and conflict
detection on slaves is not feasible for performance reasons. (I
won't elaborate unless someone feels that's not intuitively
obvious.) The only sane way I can see to have a slave database allow
serializable behavior is to WAL-log the acquisition of a snapshot by
a serializable transaction, and the rollback or commit, on the
master, and to have the serializable snapshot build on a slave
exclude any serializable transactions for which there are still
concurrent serializable transactions. Yes, that does mean WAL-
logging the snapshot acquisition even if the transaction doesn't yet
have an xid, and WAL-logging the commit or rollback even if it never
acquires an xid.

I think this solve the issue Jan raises as long as serializable
transactions are used; if they aren't there are no guarantees of
transactional integrity no matter how you track commit sequence,
unless it can be based on S2PL-type blocking locks. I'll have to
leave that to someone else to sort out.

-Kevin

#9Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#8)
Re: Exposing the Xact commit order to the user

On Mon, May 24, 2010 at 11:24 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Jan Wieck wrote:

In some systems (data warehousing, replication), the order of
commits is important, since that is the order in which changes
have become visible.

This issue intersects with the serializable work I've been doing.
While in database transactions using S2PL the above is true, in
snapshot isolation and the SSI implementation of serializable
transactions, it's not.

I think you're confusing two subtly different things. The way to
prove that a set of transactions running under some implementation of
serializability is actually serializable is to construct a serial
order of execution consistent with the view of the database that each
transaction saw. This may or may not match the commit order, as you
say. But the commit order is still the order the effects of those
transactions have become visible - if we inserted a new read-only
transaction into the stream at some arbitrary point in time, it would
see all the transactions which committed before it and none of those
that committed afterward. So I think Jan's statement is correct.

Having said that, I think your concerns about how things will look
from a slave's point of view are possibly valid. A transaction
running on a slave is essentially a read-only transaction that the
master doesn't know about. It's not clear to me whether adding such a
transaction to the timeline could result in either (a) that
transaction being rolled back or (b) some impact on which other
transactions got rolled back. If it did, that would obviously be a
problem for serializability on slaves, though your proposed fix sounds
like it would be prohibitively expensive for many users. But can this
actually happen?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#9)
Re: Exposing the Xact commit order to the user

Robert Haas wrote:

I think you're confusing two subtly different things.

The only thing I'm confused about is what benefit anyone expects to
get from looking at data between commits in some way other than our
current snapshot mechanism. Can someone explain a use case where
what Jan is proposing is better than snapshot isolation? It doesn't
provide any additional integrity guarantees that I can see.

But the commit order is still the order the effects of those
transactions have become visible - if we inserted a new read-only
transaction into the stream at some arbitrary point in time, it
would see all the transactions which committed before it and none
of those that committed afterward.

Isn't that what a snapshot does already?

your proposed fix sounds like it would be prohibitively expensive
for many users. But can this actually happen?

How so? The transaction start/end logging, or looking at that data
when building a snapshot?

-Kevin

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#10)
Re: Exposing the Xact commit order to the user

On 24/05/10 19:51, Kevin Grittner wrote:

The only thing I'm confused about is what benefit anyone expects to
get from looking at data between commits in some way other than our
current snapshot mechanism. Can someone explain a use case where
what Jan is proposing is better than snapshot isolation? It doesn't
provide any additional integrity guarantees that I can see.

Right, it doesn't. What it provides is a way to reconstruct a snapshot
at any point in time, after the fact. For example, after transactions A,
C, D and B have committed in that order, it allows you to reconstruct a
snapshot just like you would've gotten immediately after the commit of
A, C, D and B respectively. That's useful replication tools like Slony
that needs to commit the changes of those transactions in the slave in
the same order as they were committed in the master.

I don't know enough of Slony et al. to understand why that'd be better
than the current heartbeat mechanism they use, taking a snapshot every
few seconds, batching commits.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#10)
Re: Exposing the Xact commit order to the user

On Mon, May 24, 2010 at 12:51 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas  wrote:

I think you're confusing two subtly different things.

The only thing I'm confused about is what benefit anyone expects to
get from looking at data between commits in some way other than our
current snapshot mechanism.  Can someone explain a use case where
what Jan is proposing is better than snapshot isolation?  It doesn't
provide any additional integrity guarantees that I can see.

It's a tool for replication solutions to use.

But the commit order is still the order the effects of those
transactions have become visible - if we inserted a new read-only
transaction into the stream at some arbitrary point in time, it
would see all the transactions which committed before it and none
of those that committed afterward.

Isn't that what a snapshot does already?

Yes, for a particular transaction. But this is to allow transactions
to be replayed (in order) on another node.

your proposed fix sounds like it would be prohibitively expensive
for many users. But can this actually happen?

How so?  The transaction start/end logging, or looking at that data
when building a snapshot?

I guess what I'm asking is - if the reconstructed transaction order
inferred by SSI doesn't match the actual commit order, can we get a
serialization anomaly on the standby by replaying transactions there
in commit order? Can you give an example and explain how your
proposal would solve it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#3)
Re: Exposing the Xact commit order to the user

Excerpts from Robert Haas's message of dom may 23 20:38:14 -0400 2010:

On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

It would be nice if this could just be managed out of shared_buffers
rather than needing to configure a separate pool just for this
feature.

FWIW we've talked about this for years -- see old discussions about how
pg_subtrans becomes a bottleneck in certain cases and you want to
enlarge the number of buffers allocated to it (probably easy to find by
searching posts from Jignesh). I'm guessing the new notify code would
benefit from this as well.

It'd be nice to have as a side effect, but if not, IMHO this proposal
could simply use a fixed buffer pool like all other slru.c callers until
someone gets around to fixing that. Adding more GUC switches for this
strikes me as overkill.

--
Álvaro Herrera <alvherre@alvh.no-ip.org>

#14Dan Ports
drkp@csail.mit.edu
In reply to: Jan Wieck (#1)
Re: Exposing the Xact commit order to the user

On Sun, May 23, 2010 at 04:21:58PM -0400, Jan Wieck wrote:

In some systems (data warehousing, replication), the order of commits is
important, since that is the order in which changes have become visible.
This information could theoretically be extracted from the WAL, but
scanning the entire WAL just to extract this tidbit of information would
be excruciatingly painful.

This is very interesting to me as I've been doing some (research --
nowhere near production-level) work on building a transactional
application-level (i.e. memcached-like) cache atop Postgres. One of the
features I needed to support it was basically what you describe.

Without getting too far into the details of what I'm doing, I needed to
make it clear to a higher layer which commits were visible to a given
query. That is, I wanted to know both the order of commits and where
particular snapshots fit into this ordering. (A SnapshotData struct
obviously contains the visibility information, but a representation in
terms of the commit ordering is both more succinct and allows for easy
ordering comparisons).

Something you might want to consider, then, is adding an interface to
find out the timestamp of the current transaction's snapshot, i.e. the
timestamp of the most recent committed transaction visible to it. I
wouldn't expect this to be difficult to implement as transaction
completion/visibility is already synchronized via ProcArrayLock.

Each record of the Transaction Commit Info consists of

txid xci_transaction_id
timestamptz xci_begin_timestamp
timestamptz xci_commit_timestamp
int64 xci_total_rowcount

Another piece of information that seems useful to provide here would be
the logical timestamp of the transaction, i.e. a counter that's
incremented by one for each transaction. But maybe that's implicit in
the log ordering?

I'm not clear on why the total rowcount is useful, but perhaps I'm
missing something obvious.

I've actually implemented some semblance of this on Postgres 8.2, but it
sounds like what you're interested in is more sophisticated. In
particular, I wasn't at all concerned with durability or WAL stuff, and
I had some specific requirements about when it was OK to purge the
data. Because of this (and very limited development time), I just
threw something together with a simple shared buffer.

I don't think I have any useful code to offer, but let me know if
there's some way I can help out.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

#15Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#13)
Re: Exposing the Xact commit order to the user

On Mon, May 24, 2010 at 3:07 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

It'd be nice to have as a side effect, but if not, IMHO this proposal
could simply use a fixed buffer pool like all other slru.c callers until
someone gets around to fixing that.  Adding more GUC switches for this
strikes me as overkill.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#15)
Re: Exposing the Xact commit order to the user

Robert Haas wrote:

It's a tool for replication solutions to use.

I was thrown by the original post referencing "data warehousing".
For replication I definitely see that it would be good to provide
some facility to grab a coherent snapshot out of the transaction
stream, but I'm still not clear on a use case where other solutions
aren't better. If you want a *particular* past snapshot, something
akin to the transactional caching that Dan Ports mentioned seems
best. If you just want a coherent snapshot like snapshot isolation,
the current mechanisms seem to work (unless I'm missing something?).
If you want solid data integrity querying the most recent replicated
data, the proposal I posted earlier in the thread is the best I can
see, so far.

if the reconstructed transaction order inferred by SSI doesn't
match the actual commit order, can we get a serialization anomaly
on the standby by replaying transactions there in commit order?

Yes. If we don't do *something* to address it, the replicas
(slaves) will operate as read-only snapshot isolation, not true
serializable.

Can you give an example and explain how your proposal would solve
it?

I gave an example (without rigorous proof accompanying it, granted)
earlier in the thread. In that example, if you allow a selection
against a snapshot which includes the earlier commit (the update of
the control table) and before the later commits (the receipts which
used the old deposit date) you have exactly the kind of
serialization anomaly which the work in progress prevents on the
source (master) database -- the receipts *appear* to run in earlier
transactions because the see the pre-update deposit date, but they
show up out of order. As far as I'm concerned this is only a
problem if the user *requested* serializable behavior for all
transactions involved.

If we send the information I suggested in the WAL stream, then any
slave using the WAL stream could build a snapshot for a serializable
transaction which excluded serializable transactions from the source
which overlap with still-pending serializable transactions on the
source. In this example, the update of the control table would not
be visible to a serializable transaction on the slave until any
overlapping serializable transactions (which would include any
receipts using the old date) had also committed, so you could never
see the writes out of order.

I don't think that passing detailed predicate locking information
would be feasible from a performance perspective, but since the
slaves are read-only, I think it is fine to pass just the minimal
transaction-level information I described.

-Kevin

#17Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#16)
Re: Exposing the Xact commit order to the user

On Mon, May 24, 2010 at 4:03 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas  wrote:

It's a tool for replication solutions to use.

I was thrown by the original post referencing "data warehousing".
For replication I definitely see that it would be good to provide
some facility to grab a coherent snapshot out of the transaction
stream, but I'm still not clear on a use case where other solutions
aren't better.  If you want a *particular* past snapshot, something
akin to the transactional caching that Dan Ports mentioned seems
best.  If you just want a coherent snapshot like snapshot isolation,
the current mechanisms seem to work (unless I'm missing something?).
If you want solid data integrity querying the most recent replicated
data, the proposal I posted earlier in the thread is the best I can
see, so far.

Well, AIUI, what you're really trying to do is derive the delta
between an old snapshot and a newer snapshot.

Can you give an example and explain how your proposal would solve
it?

I gave an example (without rigorous proof accompanying it, granted)
earlier in the thread.  In that example, if you allow a selection
against a snapshot which includes the earlier commit (the update of
the control table) and before the later commits (the receipts which
used the old deposit date) you have exactly the kind of
serialization anomaly which the work in progress prevents on the
source (master) database -- the receipts *appear* to run in earlier
transactions because the see the pre-update deposit date, but they
show up out of order.

Yep, I see it now.

 As far as I'm concerned this is only a
problem if the user *requested* serializable behavior for all
transactions involved.

Agreed.

If we send the information I suggested in the WAL stream, then any
slave using the WAL stream could build a snapshot for a serializable
transaction which excluded serializable transactions from the source
which overlap with still-pending serializable transactions on the
source.  In this example, the update of the control table would not
be visible to a serializable transaction on the slave until any
overlapping serializable transactions (which would include any
receipts using the old date) had also committed, so you could never
see the writes out of order.

I don't think that passing detailed predicate locking information
would be feasible from a performance perspective, but since the
slaves are read-only, I think it is fine to pass just the minimal
transaction-level information I described.

I suspect that's still going to be sort of hard on performance, but
maybe we should get serializable working and committed on one node
first and then worry about how to distribute it. I think there might
be other approaches to this problem (global transaction coordinator?
standby requests snapshot from primary?).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#18Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#10)
Re: Exposing the Xact commit order to the user

On 5/24/2010 12:51 PM, Kevin Grittner wrote:

Robert Haas wrote:

I think you're confusing two subtly different things.

The only thing I'm confused about is what benefit anyone expects to
get from looking at data between commits in some way other than our
current snapshot mechanism. Can someone explain a use case where
what Jan is proposing is better than snapshot isolation? It doesn't
provide any additional integrity guarantees that I can see.

But the commit order is still the order the effects of those
transactions have become visible - if we inserted a new read-only
transaction into the stream at some arbitrary point in time, it
would see all the transactions which committed before it and none
of those that committed afterward.

Isn't that what a snapshot does already?

It does and the proposed is a mere alternative serving the same purpose.

Have you ever looked at one of those queries, that Londiste or Slony
issue against the provider DB in order to get all the log data that has
been committed between two snapshots? Is that really the best you can
think of?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#19Jan Wieck
JanWieck@Yahoo.com
In reply to: Dan Ports (#14)
Re: Exposing the Xact commit order to the user

On 5/24/2010 3:10 PM, Dan Ports wrote:

I'm not clear on why the total rowcount is useful, but perhaps I'm
missing something obvious.

It is a glimpse into the future. Several years of pain doing replication
work has taught me that knowing approximately who much work the next
chunk will be "before you select it all" is a really useful thing.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

#20Dan Ports
drkp@csail.mit.edu
In reply to: Kevin Grittner (#8)
Re: Exposing the Xact commit order to the user

On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote:

Jan Wieck wrote:

In some systems (data warehousing, replication), the order of
commits is important, since that is the order in which changes
have become visible.

This issue intersects with the serializable work I've been doing.
While in database transactions using S2PL the above is true, in
snapshot isolation and the SSI implementation of serializable
transactions, it's not. In particular, the snapshot anomalies which
can cause non-serializable behavior happen precisely because the
apparent order of execution doesn't match anything so linear as
order of commit.

All true, but this doesn't pose a problem in snapshot isolation. Maybe
this is obvious to everyone else, but just to be clear: a transaction's
snapshot is determined entirely by which transactions committed before
it snapshotted (and hence are visible to it). Thus, replaying update
transactions in the sae order on a slave makes the same sequence of
states visible to it.

Of course (as in your example) some of these states could expose
snapshot isolation anomalies. But that's true on a single-replica
system too.

Now, stepping into the SSI world...

Replicating or recreating the whole predicate locking and conflict
detection on slaves is not feasible for performance reasons. (I
won't elaborate unless someone feels that's not intuitively
obvious.) The only sane way I can see to have a slave database allow
serializable behavior is to WAL-log the acquisition of a snapshot by
a serializable transaction, and the rollback or commit, on the
master, and to have the serializable snapshot build on a slave
exclude any serializable transactions for which there are still
concurrent serializable transactions. Yes, that does mean WAL-
logging the snapshot acquisition even if the transaction doesn't yet
have an xid, and WAL-logging the commit or rollback even if it never
acquires an xid.

One important observation is that any anomaly that occurs on the slave
can be resolved by aborting a local read-only transaction. This is a
good thing, because the alternatives are too horrible to consider.

You could possibly cut the costs of predicate locking by having the
master ship with each transaction the list of predicate locks it
acquired. But you'd still have to track locks for read-only
transactions, so maybe that's not a significant cost improvement. On
the other hand, if you're willing to pay the price of serializability
on the master, why not the slaves too?

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

#21Florian Pflug
fgp@phlo.org
In reply to: Dan Ports (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Pflug (#21)
#23Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#22)
#24Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Dan Ports (#20)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Jan Wieck (#1)
#26Dan Ports
drkp@csail.mit.edu
In reply to: Nicolas Barbier (#24)
#27Florian Pflug
fgp@phlo.org
In reply to: Dan Ports (#26)
#28Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#27)
#29Dan Ports
drkp@csail.mit.edu
In reply to: Florian Pflug (#27)
#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jan Wieck (#18)
#31Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Dan Ports (#26)
#32Florian Pflug
fgp@phlo.org
In reply to: Dan Ports (#29)
#33Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Florian Pflug (#27)
#34Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Florian Pflug (#32)
#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#32)
#36Jan Wieck
JanWieck@Yahoo.com
In reply to: Greg Sabino Mullane (#6)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#36)
#38Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#17)
#39Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#25)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Jan Wieck (#39)
#41Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#40)
#42Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#37)
#43Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#30)
#44Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#43)
#45Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#1)
#46Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#45)
#47Steve Singer
steve@ssinger.info
In reply to: Jan Wieck (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Steve Singer (#47)
#49Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#45)
#50Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#49)
#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jan Wieck (#46)
#52Jan Wieck
JanWieck@Yahoo.com
In reply to: Heikki Linnakangas (#51)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jan Wieck (#52)
#54Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#53)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#54)
#56Jan Wieck
JanWieck@Yahoo.com
In reply to: Heikki Linnakangas (#53)
#57Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jan Wieck (#43)
#58Jan Wieck
JanWieck@Yahoo.com
In reply to: Dimitri Fontaine (#54)
#59Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jan Wieck (#56)
#60Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#50)
#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#59)
#62Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jan Wieck (#60)
#63Jan Wieck
JanWieck@Yahoo.com
In reply to: Heikki Linnakangas (#61)
#64Jan Wieck
JanWieck@Yahoo.com
In reply to: Heikki Linnakangas (#62)
#65Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#57)
#66Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jan Wieck (#65)
#67Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#50)
#68Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#67)
#69Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#68)
#70Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#41)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#70)
#72Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#70)
#73Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#72)
#74Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#73)
#75Chris Browne
cbbrowne@acm.org
In reply to: Jan Wieck (#1)
#76Chris Browne
cbbrowne@acm.org
In reply to: Kevin Grittner (#10)
#77Bruce Momjian
bruce@momjian.us
In reply to: Chris Browne (#75)
#78Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#77)
#79Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#78)
#80Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jan Wieck (#78)
#81Chris Browne
cbbrowne@acm.org
In reply to: Jan Wieck (#1)
#82Chris Browne
cbbrowne@acm.org
In reply to: Jan Wieck (#78)
#83Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#79)
#84Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#78)
#85Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Bruce Momjian (#84)
#86Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#85)
#87Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#83)
#88Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#87)
#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Chris Browne (#81)
#90Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#84)
#91Jan Wieck
JanWieck@Yahoo.com
In reply to: Kevin Grittner (#86)
#92Jan Wieck
JanWieck@Yahoo.com
In reply to: Alvaro Herrera (#89)
#93Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#87)
#94Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#87)
#95Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#93)
#96Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#95)
#97Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jan Wieck (#92)
#98Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#95)
#99Jan Wieck
JanWieck@Yahoo.com
In reply to: Alvaro Herrera (#97)
#100Marko Kreen
markokr@gmail.com
In reply to: Robert Haas (#96)
#101Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Marko Kreen (#100)
#102Simon Riggs
simon@2ndQuadrant.com
In reply to: Jan Wieck (#1)