Snapshot synchronization, again...
The snapshot synchronization discussion from the parallel pg_dump
patch somehow ended without a clear way to go forward.
Let me sum up what has been brought up and propose a short- and
longterm solution.
Summary:
Passing snapshot sync information can be done either:
a) by returning complete snapshot information from the backend to the
client so that the client can pass it along to a different backend
b) or by returning only a unique identifier to the client and storing
the actual snapshot data somewhere on the server side
Advantage of a: no memory is used in the backend and no memory needs
to get cleaned up, it is also theoretically possible that we could
forward that data to a hot standby server and do e.g. a dump partially
on the master server and partially on the hot standby server or among
several hot standby servers.
Disadvantage of a: The snapshot must be validated to make sure that
its information is still current, it might be difficult to cover all
cases of this validation. A client can not only access exactly a
published snapshot, but just about any snapshot that fits and passes
the validation checks (this is more a disadvantage than an advantage
because it allows to see a database state that never existed in
reality).
Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues. Note that we are also free to change
the internal format of the chunk of data we return whenever we like,
so we are free to enhance this feature in the future, transparently to
the client.
Thoughts?
Joachim
Excerpts from Joachim Wieland's message of jue dic 30 09:31:47 -0300 2010:
Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.
Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues.
So one registered snapshot per transaction? Sounds a reasonable
limitation (I doubt there's a use case for more than that, anyway).
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?
I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
best regards,
Florian Pflug
On 30.12.2010 16:49, Florian Pflug wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
doesn't modify the database itself, it could still send NOTIFYs or call
a PL function to do all sorts of things outside the database. Imagine
that you're paranoid about data integrity, and have a security definer
function that runs cross checks on the data. If it finds any
anomalities, it wakes up the operator or forces shutdown or similar.
Now a malicious user could set a snapshot that passes the basic validity
checks, ie. xmin >= GlobalXmin, but contains a combination of still
in-progress that never existed in reality. If he then calls the
paranoia-function, it would see an inconsistent state of committed
tuples and get upset.
Maybe that's a bit far-stretched, but it's not entirely clear that
running with an inconsistent snapshot is harmless.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).
I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?
Joachim
On Thu, Dec 30, 2010 at 9:49 AM, Florian Pflug <fgp@phlo.org> wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?
I am not talking about having synchronized snapshots among standby
servers at all.
I am only proposing a client API that will work for this future idea as well.
I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
What's the use case for it? As soon as nobody comes up with a
reasonable use case for it, let's aim for the robust version.
Joachim
On 12/30/2010 10:45 PM, Heikki Linnakangas wrote:
On 30.12.2010 16:49, Florian Pflug wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
doesn't modify the database itself, it could still send NOTIFYs or call
a PL function to do all sorts of things outside the database. Imagine
that you're paranoid about data integrity, and have a security definer
function that runs cross checks on the data. If it finds any
anomalities, it wakes up the operator or forces shutdown or similar.
are people actually doing that in reality? I'm also having a hard time
picturing a realistic example of what that "data integrity check"
function would actually being able to check with default isolation mode
and concurrent activity...
Now a malicious user could set a snapshot that passes the basic validity
checks, ie. xmin >= GlobalXmin, but contains a combination of still
in-progress that never existed in reality. If he then calls the
paranoia-function, it would see an inconsistent state of committed
tuples and get upset.
sure but I would expect being able to set a snapshot requiring either
superuser or some sort of "WITH SNAPSHOT" grant thingy - and in general
there are much more trivial and not that obscure scenarios a "normal"
user can cause the admin to get paged :)
Maybe that's a bit far-stretched, but it's not entirely clear that
running with an inconsistent snapshot is harmless.
well there has been some discussion with to the SSI stuff that we might
want to reconsider our definition of "read-only" maybe that would be the
right way to approach the problem?
Stefan
Excerpts from Joachim Wieland's message of vie dic 31 00:15:57 -0300 2010:
On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?
A backend can have any number of snapshots registered, and those don't
allow GlobalXmin to advance. Consider an open cursor, for example.
Even if the rest of the transaction is read committed, the snapshot
registered by the open cursor still holds back GlobalXmin. My
(handwavy) idea is that whenever the transaction calls
pg_publish_snapshot(), said snapshot is registered, which makes it safe
to use even if the transaction continues to operate and obtain newer
snapshots.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Dec 31, 2010 at 8:28 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
A backend can have any number of snapshots registered, and those don't
allow GlobalXmin to advance. Consider an open cursor, for example.
Even if the rest of the transaction is read committed, the snapshot
registered by the open cursor still holds back GlobalXmin. My
(handwavy) idea is that whenever the transaction calls
pg_publish_snapshot(), said snapshot is registered, which makes it safe
to use even if the transaction continues to operate and obtain newer
snapshots.
Cool, even better that this is taken care of already :-)
Thanks,
Joachim
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
So here's the patch implementing this idea.
I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().
A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).
There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.
These are the implementation details and restrictions of the patch:
The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
- adds its own xid (if any) to the xip-array
The importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwards
The snapshot is invalidated on transaction commit/rollback as well as when doing
"prepare transaction".
Comments welcome.
Joachim
Attachments:
pg_import_export_snapshot.difftext/x-patch; charset=US-ASCII; name=pg_import_export_snapshot.diffDownload+308-0
On Jan7, 2011, at 12:41 , Joachim Wieland wrote:
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
These are the implementation details and restrictions of the patch:The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
That seems overly strict. Wouldn't a warning about this in the docs suffice?
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
If that is necessary, wouldn't a sub-transaction that was sub-committed before
exporting the snapshot also pose a problem? The snapshot wouldn't include
the sub-transaction's xid, so once the exporting transaction has committed any
importing transaction would suddenly see the sub-transaction's changes.
- adds its own xid (if any) to the xip-array
Per the previous paragraph, I believe it should also add the xid's of all
non-aborted sub-transactions to the subxip-array, overflowing the array if
necessary.
The importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
Sounds fine so far.
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwards
Change a snapshot mid-transaction like that seems risky. I'd suggest to simply
decree that an importing transaction must be read-only and must not yet have set
its snapshot. So effectively, the only allowed calling sequence would be
begin;
set transaction isolation level serializable read only;
pg_import_snapshot(...);
best regards,
Florian Pflug
Hello Joachim,
I'm reviewing this patch for CommitFest 2011-01.
On Fri, Jan 07, 2011 at 06:41:38AM -0500, Joachim Wieland wrote:
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
I like this design. It makes the backend's use of resources to track exported
snapshots very simple. It also avoids pessimizing the chances that we can, at
some point, synchronize snapshots across various standby clusters without
changing the user-facing API.
So here's the patch implementing this idea.
I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.These are the implementation details and restrictions of the patch:
The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
- adds its own xid (if any) to the xip-arrayThe importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
I don't know, but this restriction does not seem onerous.
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwardsThe snapshot is invalidated on transaction commit/rollback as well as when doing
"prepare transaction".Comments welcome.
General points:
Basics: patch has the correct format and applies cleanly (offset in pg_proc.h).
It includes no tests, but perhaps it should not. Really testing this requires
concurrency, and our test framework is not set up for that. We could add some
cheap tests to ensure the functions fail cleanly in some basic situations, but
that would be about it. The patch does not, but should, document the new
functions. The new functions also need comments preceding their definitions,
similar to those found on all other functions in procarray.c.
You add four elog() calls, all of which should be ereport() to facilitate
translation. Also consider suitable error codes. The error messages do not
follow style; in particular, they begin with capitals. See the style guide,
http://www.postgresql.org/docs/current/interactive/error-style-guide.html.
ImportSnapshot() has no paths that justify returning false. It should always
return true or throw a suitable error.
See below for various specific comments.
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 95beba8..c24150f 100644 *** a/src/backend/storage/ipc/ipci.c --- b/src/backend/storage/ipc/ipci.c *************** CreateSharedMemoryAndSemaphores(bool mak *** 124,129 **** --- 124,130 ---- size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, SyncSnapshotShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *************** CreateSharedMemoryAndSemaphores(bool mak *** 228,233 **** --- 229,235 ---- BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + SyncSnapshotInit();#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 980996e..e851bcd 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *************** *** 50,60 **** --- 50,62 ---- #include "access/transam.h" #include "access/xact.h" #include "access/twophase.h" + #include "libpq/md5.h" #include "miscadmin.h" #include "storage/procarray.h" #include "storage/spin.h" #include "storage/standby.h" #include "utils/builtins.h" + #include "utils/bytea.h" #include "utils/snapmgr.h"*************** static int KnownAssignedXidsGetAndSetXmi *** 159,164 **** --- 161,170 ---- static TransactionId KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level);+ typedef char snapshotChksum[16]; + static snapshotChksum *syncSnapshotChksums; + static Snapshot exportedSnapshot; + /* * Report shared-memory space needed by CreateSharedProcArray. */ *************** KnownAssignedXidsDisplay(int trace_level *** 3065,3067 **** --- 3071,3335 ----pfree(buf.data); } + + + /* + * Report space needed for our shared memory area, which is basically an + * md5 checksum per connection. + */ + Size + SyncSnapshotShmemSize(void) + { + return PROCARRAY_MAXPROCS * sizeof(snapshotChksum); + } + + void + SyncSnapshotInit(void) + { + Size size; + bool found; + int i; + + size = SyncSnapshotShmemSize(); + + syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums", + size, &found); + if (!found) + for (i = 0; i < PROCARRAY_MAXPROCS; i++) + memset(syncSnapshotChksums[i], 0, sizeof(snapshotChksum)); + }
No need for a loop:
memset(syncSnapshotchksums, 0, PROCARRAY_MAXPROCS * sizeof(snapshotChksum));
"00000000000000000000" is a valid md5 message digest. To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. Maybe that hole
is acceptable, though.
+ + void + InvalidateExportedSnapshot(void) + { + if (syncSnapshotChksums[MyBackendId][0] == '\0') + return;
The first byte of a valid message digest will regularly be zero.
+ + memset(syncSnapshotChksums[MyBackendId], 0, sizeof(snapshotChksum)); + + UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner); + exportedSnapshot = NULL; + } + + static void + snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...) + { + va_list ap; + int ret; + + if (*bufsize_left < 64) + { + /* enlarge buffer by 1024 bytes */ + *buffer = repalloc(*buffer, *bufsize_filled + 1024); + *bufsize_left = *bufsize_filled + 1024; + } + + va_start(ap, fmt); + ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap); + va_end(ap); + + /* There shouldn't be any error, we leave enough room for the data that we + * are writing (only numbers basically). */ + Assert(ret < *bufsize_left && ret > 0); + + *bufsize_left -= ret; + *bufsize_filled += ret; + + Assert(strlen(*buffer) == *bufsize_filled); + } + + bytea * + ExportSnapshot(Snapshot snapshot) + { + int bufsize = 1024; + int bufsize_filled = 0; /* doesn't include NUL byte */ + int bufsize_left = bufsize - bufsize_filled; + char *buf = (char *) palloc(bufsize); + snapshotChksum cksum; + int i; + + /* In a subtransaction we don't see our subxip values in the snapshot so + * they would be missing in the backend applying it. */ + if (GetCurrentTransactionNestLevel() != 1) + elog(ERROR, "Can only export a snapshot from a top level transaction."); + + /* Write up all the data that we return */ + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "did:%d ", MyDatabaseId); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "bid:%d ", MyBackendId); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xmi:%d ", snapshot->xmin); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xma:%d ", snapshot->xmax); + for (i = 0; i < snapshot->xcnt; i++) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xip:%d ", snapshot->xip[i]); + /* + * Finally add our own XID if we have one, since by definition we will + * still be running when the other transaction takes over the snapshot. + */ + if (TransactionIdIsValid(GetTopTransactionIdIfAny())) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xip:%d ", GetTopTransactionIdIfAny()); + if (snapshot->suboverflowed) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "sof:1 "); + else + for (i = 0; i < snapshot->subxcnt; i++) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "sxp:%d ", snapshot->subxip[i]); + + /* Register the snapshot */ + snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner); + exportedSnapshot = snapshot; + + pg_md5_binary(buf, bufsize_filled, (void *) &cksum); + + memcpy(syncSnapshotChksums[MyBackendId], &cksum, sizeof(snapshotChksum));
No need for an intermediate variable:
pg_md5_binary(buf, bufsize_filled, &syncSnapshotChksums[MyBackendId]);
+ + if (!XactReadOnly) + elog(WARNING, "A snapshot exporting function should be readonly.");
There are legitimate use cases for copying the snapshot of a read-write
transaction. Consider a task like calculating some summaries for insert into a
table. To speed this up, you might notionally partition the source data and
have each of several workers calculate each partition. I'd vote for removing
this warning entirely.
If we do keep the warning, it's the transaction that should be read-only, not a
"function". My first reaction to the text was "why not?". If the documentation
for pg_export_snapshot explains, that might be enough. Primary wording like
"importing transactions will not see changes from this transaction" with wording
like yours in a HINT might help.
+ + return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf))); + } + + static Oid + parseOidFromText(char **s, const char *prefix) + { + char *n, *p = strstr(*s, prefix); + Oid oid; + + if (!p) + return InvalidOid; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return InvalidOid; + *n = '\0'; + oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p))); + *s = n + 1; + return oid; + } + + static bool + parseBoolFromText(char **s, const char *prefix) + { + /* It's safe to overload parseOid to parse 0/1. This returns false if the + * entry could not be found, at least as long as InvalidOid is defined to be 0. */ + Assert(InvalidOid == (Oid) 0); + return (bool) parseOidFromText(s, prefix); + } + + static TransactionId + parseXactFromText(char **s, const char *prefix) + { + char *n, *p = strstr(*s, prefix); + TransactionId xid; + + if (!p) + return InvalidTransactionId; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return InvalidTransactionId; + *n = '\0'; + xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p))); + *s = n + 1; + return xid; + } + + bool + ImportSnapshot(bytea *snapshotData, Snapshot snapshot) + { + snapshotChksum cksum; + Oid databaseId; + Oid backendId; + int i; + TransactionId xid; + int len = VARSIZE_ANY_EXHDR(snapshotData); + char *s = palloc(len + 1); + + if (GetCurrentTransactionNestLevel() != 1) + elog(ERROR, "Can only import a snapshot to a top level transaction."); + + Assert(len > 0);
This is user-facing; "SELECT pg_import_snapshot('')" must not trigger an
assertion failure.
+ strncpy(s, VARDATA_ANY(snapshotData), len); + s[len] = '\0'; + + pg_md5_binary(s, strlen(s), (void *) &cksum);
You know strlen(s) == len; no need to recalculate.
+ + if ((databaseId = parseOidFromText(&s, "did:")) == InvalidOid) + return false; + if (databaseId != MyDatabaseId) + return false; + if ((backendId = parseOidFromText(&s, "bid:")) == InvalidOid) + return false; + if (backendId == MyBackendId) + return false;
InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?
My backend IDs seem to start at 1 in practice. However, by inserting a
too-large bid, an attacker could have us overrun syncSnapshotChksums. For that,
we need to test 0 <= bid <= PROCARRAY_MAXPROCS. Example crasher:
select pg_import_snapshot('did:16384 bid:4000000000 xmi:795 xma:795');
I'm thinking the order of validation should be:
1. Extract bid.
2. Validate bid. Throw error on out-of-bounds.
3. Confirm checksum. Throw error on mismatch.
4. Extract did. Assert databaseId != InvalidOid.
5. Error if databaseId != MyDatabaseid.
+ + /* + * Lock considerations: + * + * syncSnapshotChksums[backendId] is only changed by the backend with ID + * backendID and read by another backend that is asked to import a + * snapshot. + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum + * of a valid snapshot. + * Every comparision to the checksum while it is being written or + * deleted is okay to fail, so we don't need to take a lock at all. + */ + + if (memcmp(&cksum, syncSnapshotChksums[backendId], sizeof(snapshotChksum))) + return false; + + snapshot->xmin = parseXactFromText(&s, "xmi:"); + Assert(snapshot->xmin != InvalidTransactionId); + snapshot->xmax = parseXactFromText(&s, "xma:"); + Assert(snapshot->xmax != InvalidTransactionId); + + i = 0; + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid;
This works on a virgin GetSnapshotData() snapshot, which always has enough space
(sized according to max_connections). A snapshot from CopySnapshot(), however,
has just enough space for the original usage. I get a SIGSEGV at this line with
the attached test script. If I change things around so xip starts non-NULL, I
get messages like these as the code scribbles past the end of the allocation:
WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93280
WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93350
The last CopySnapshot before the crash had this stack trace:
#0 CopySnapshot (snapshot=0xc27980) at snapmgr.c:203
#1 0x00000000008109e5 in PushActiveSnapshot (snap=0xc27980) at snapmgr.c:283
#2 0x00000000006e1c13 in PortalStart (portal=0xcaf7b8, params=0x0, snapshot=0x0) at pquery.c:508
#3 0x00000000006dc411 in exec_simple_query (query_string=0xd3ee38 "SELECT pg_import_snapshot('did:16384 bid:2 xmi:756 xma:756 xip:756 ')") at postgres.c:1020
#4 0x00000000006e0633 in PostgresMain (argc=2, argv=0xcb9760, username=0xc90658 "nm") at postgres.c:3939
#5 0x0000000000695520 in BackendRun (port=0xcbb7d0) at postmaster.c:3578
#6 0x0000000000694bec in BackendStartup (port=0xcbb7d0) at postmaster.c:3263
#7 0x0000000000691f85 in ServerLoop () at postmaster.c:1449
#8 0x000000000069174f in PostmasterMain (argc=1, argv=0xc8af30) at postmaster.c:1110
#9 0x000000000060e4e2 in main (argc=1, argv=0xc8af30) at main.c:199
Do you have a recipe for a getting ImportSnapshot to see a non-CopySnapshot
snapshot that I could use to continue testing? All the things I tried ended
this way.
+ } + snapshot->xcnt = i; + + /* + * We only write "sof:1" if the snapshot overflowed. If not, then there is + * no "sof:x" entry at all and parseBoolFromText() will just return false. + */ + snapshot->suboverflowed = parseBoolFromText(&s, "sof:"); + + i = 0; + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId) + snapshot->subxip[i++] = xid; + snapshot->subxcnt = i; + /* Leave snapshot->curcid as is. */ + + /* + * No ProcArrayLock held here, we assume that a write is atomic. Also note + * that MyProc->xmin can go backwards here. However this is safe because + * the xmin we set here is the same as in the backend's proc->xmin whose + * snapshot we are copying. At this very moment, anybody computing a + * minimum will calculate at least this xmin as the overall xmin with or + * without us setting MyProc->xmin to this value.
Though unlikely, the other backend may not even exist by now. Indeed, that
could have happened and RecentGlobalXmin advanced since we compared the
checksum. Not sure what the right locking is here, but something is needed.
+ * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */
I'm thinking this case should yield an ERROR. Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.
+ if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin)) + MyProc->xmin = snapshot->xmin; + + /* + * If we are in read committed mode then the next query will execute with a + * new snapshot making this function call quite useless. + */ + if (!IsolationUsesXactSnapshot()) + elog(WARNING, "A snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE");
The message should convey that the test (correctly) allows either REPEATABLE
READ or SERIALIZABLE.
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? If it does
work, we should take the precedent of SET LOCAL and issue no warning. If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.
This warning can appear twice due to the two calls to ImportSnapshot() in
pg_import_snapshot(); that's a bit ugly.
+ + return true; + } + diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 273d8bd..75315bb 100644 *** a/src/backend/utils/time/snapmgr.c --- b/src/backend/utils/time/snapmgr.c *************** *** 29,34 **** --- 29,35 ---- #include "access/xact.h" #include "storage/proc.h" #include "storage/procarray.h" + #include "utils/builtins.h" #include "utils/memutils.h" #include "utils/memutils.h" #include "utils/resowner.h" *************** AtEarlyCommit_Snapshot(void) *** 523,528 **** --- 524,530 ---- TopTransactionResourceOwner); registered_xact_snapshot = false;+ InvalidateExportedSnapshot();
}/* *************** AtEOXact_Snapshot(bool isCommit) *** 559,561 **** --- 561,585 ---- FirstSnapshotSet = false; registered_xact_snapshot = false; } + + Datum + pg_export_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot()); + PG_RETURN_BYTEA_P(snapshotData); + } + + Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
Is it valid to scribble directly on snapshots like this?
Show quoted text
+ + PG_RETURN_BOOL(ret); + } + diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index c624243..fa06f07 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** DATA(insert OID = 2171 ( pg_cancel_backe *** 3375,3380 **** --- 3375,3384 ---- DESCR("cancel a server process' current query"); DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ )); DESCR("terminate a server process"); + DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ )); + DESCR("export a snapshot"); + DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ )); + DESCR("import a snapshot"); DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ )); DESCR("prepare for taking an online backup"); DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ )); diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index ea030d6..a1ac8c6 100644 *** a/src/include/storage/procarray.h --- b/src/include/storage/procarray.h *************** extern void XidCacheRemoveRunningXids(Tr *** 71,74 **** --- 71,81 ---- int nxids, const TransactionId *xids, TransactionId latestXid);+ extern Size SyncSnapshotShmemSize(void); + extern void SyncSnapshotInit(void); + + extern bytea *ExportSnapshot(Snapshot snapshot); + extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot); + extern void InvalidateExportedSnapshot(void); + #endif /* PROCARRAY_H */ diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h index f03647b..d0e3f9a 100644 *** a/src/include/utils/snapmgr.h --- b/src/include/utils/snapmgr.h *************** extern void AtSubAbort_Snapshot(int leve *** 43,46 **** --- 43,49 ---- extern void AtEarlyCommit_Snapshot(void); extern void AtEOXact_Snapshot(bool isCommit);+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS); + extern Datum pg_import_snapshot(PG_FUNCTION_ARGS); + #endif /* SNAPMGR_H */
Attachments:
crash-sync-snap.plapplication/x-perlDownload
Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.
On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote:
"00000000000000000000" is a valid md5 message digest. To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. Maybe that hole
is acceptable, though.
In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.
+ + if (!XactReadOnly) + elog(WARNING, "A snapshot exporting function should be readonly.");There are legitimate use cases for copying the snapshot of a read-write
transaction. Consider a task like calculating some summaries for insert into a
table. To speed this up, you might notionally partition the source data and
have each of several workers calculate each partition. I'd vote for removing
this warning entirely.
Warning removed, adding this fact to the documentation only is fine with me.
InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?
Yes, there is a check in utils/init/postinit.c that makes sure that 0
is never a valid backendId. But anyway, I have now made this more
explicit by adding an own parse routine for ints.
+ while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid;This works on a virgin GetSnapshotData() snapshot, which always has enough space
(sized according to max_connections). A snapshot from CopySnapshot(), however,
has just enough space for the original usage. I get a SIGSEGV at this line with
the attached test script. If I change things around so xip starts non-NULL, I
get messages like these as the code scribbles past the end of the allocation:
This has been fixed. xip/subxip are now allocated separately if necessary.
Though unlikely, the other backend may not even exist by now. Indeed, that
could have happened and RecentGlobalXmin advanced since we compared the
checksum. Not sure what the right locking is here, but something is needed.
Good catch. What I have done now is a second check at the end of
ImportSnapshot(). At that point we have set up the new snapshot and
adjusted MyProc->xmin. If we succeed, then we are fine. If not we
throw an error and invalidate the whole transaction.
+ * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */I'm thinking this case should yield an ERROR. Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.
This can happen when you do:
DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? If it does
work, we should take the precedent of SET LOCAL and issue no warning. If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.
No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.
+ Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.
I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.
So thanks again and I'm looking forward to your next review... :-)
Joachim
On Wed, Jan 19, 2011 at 12:12:39AM -0500, Joachim Wieland wrote:
Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.
Great. I do get an assertion failure with this sequence:
BEGIN; SELECT pg_export_snapshot(); ROLLBACK;
SELECT 1;
TRAP: FailedAssertion("!(RegisteredSnapshots > 0)", File: "snapmgr.c", Line: 430)
Perhaps some cleanup is missing from a ROLLBACK path?
On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote:
"00000000000000000000" is a valid md5 message digest. ?To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. ?Maybe that hole
is acceptable, though.In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.
Just to clarify, I was envisioning something like:
typedef struct { bool valid; char value[16]; } snapshotChksum;
I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.
+ ? ? ?* Instead we must check to not go forward (we might have opened a cursor + ? ? ?* in this transaction and still have its snapshot registered) + ? ? ?*/I'm thinking this case should yield an ERROR. ?Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.This can happen when you do:
DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;
I see now. You set snapshot->xmin unconditionally, but never move MyProc->xmin
forward, only backward. Yes; that seems just right.
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? ?If it does
work, we should take the precedent of SET LOCAL and issue no warning. ?If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.
You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.
Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.
Okay. Had no special reason to believe otherwise, just didn't know.
I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.
Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?
General tests involving only DML seem to do the right thing. Subtransaction
handling looks sound. Patch runs cleanly according to Valgrind.
So thanks again and I'm looking forward to your next review... :-)
Hope this one helps, too.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 8b36df4..29c9426 100644
+ bytea * + ExportSnapshot(Snapshot snapshot) + { + int bufsize = 1024; + int bufsize_filled = 0; /* doesn't include NUL byte */ + int bufsize_left = bufsize - bufsize_filled; + char *buf = (char *) palloc(bufsize); + int i; + TransactionId *children; + int nchildren; + + /* In a subtransaction we don't see our open subxip values in the snapshot + * so they would be missing in the backend applying it. */ + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less? + * It's a valid transaction state but invalid for the requested operation. */ + if (GetCurrentTransactionNestLevel() != 1) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("can only export a snapshot from a top level transaction")));
When I'm not sure about questions like this (as in, 99.8% of the time), I
usually grep one of the .po files for messages that seem similar, then see what
they used. In this case, the most-similar is "SET TRANSACTION ISOLATION LEVEL
must not be called in a subtransaction" in assign_XactIsoLevel(), which uses
ERRCODE_ACTIVE_SQL_TRANSACTION. I'd also consider making the message similar to
that one, like "cannot export a snapshot from a subtransaction".
+ static int + parseIntFromText(char **s, const char *prefix, int notfound) + { + char *n, *p = strstr(*s, prefix); + int i; + + if (!p) + return notfound; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return notfound; + *n = '\0'; + i = DatumGetObjectId(DirectFunctionCall1(int4in, CStringGetDatum(p)));
DatumGetInt32
+ bool + ImportSnapshot(bytea *snapshotData, Snapshot snapshot) + { + snapshotChksum cksum; + Oid databaseId; + BackendId backendId; + int i; + TransactionId xid; + int len = VARSIZE_ANY_EXHDR(snapshotData); + char *s = palloc(len + 1); + MemoryContext oldctx; + + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less? + * It's a valid transaction state but invalid for the requested operation. */ + if (GetCurrentTransactionNestLevel() != 1) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("can only import a snapshot to a top level transaction")));
See discussion of similar code above.
+ + if (len == 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information"))); + + strncpy(s, VARDATA_ANY(snapshotData), len); + s[len] = '\0'; + + pg_md5_hash(s, len, (void *) &cksum); + + databaseId = parseOidFromText(&s, "did:"); + backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId); + + if (databaseId != MyDatabaseId + || backendId == InvalidBackendId + /* + * Make sure backendId is in a reasonable range before using it as an + * array subscript. + */ + || backendId >= PROCARRAY_MAXPROCS + || backendId < 0 + || backendId == MyBackendId + /* + * Lock considerations: + * + * syncSnapshotChksums[backendId] is only changed by the backend with ID + * backendID and read by another backend that is asked to import a + * snapshot. + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum + * of a valid snapshot. + * Every comparision to the checksum while it is being written or + * deleted is okay to fail, so we don't need to take a lock at all. + */ + || strcmp(cksum, syncSnapshotChksums[backendId]) != 0) + { + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information"))); + }
That error message is good for the second, third, and fourth conditions: those
only happen when we're given something not generated by pg_export_snapshot().
The first and fifth tests deserve different error messages (perhaps "snapshot is
from a different database"; "cannot export and import a snapshot in the same
connection"). The sixth test is also a bit different; failure arises there when
the input is bogus or when the exporting transaction has ended. Failing to hold
that transaction open until all child transactions have imported might be a
common early mistake when using this feature. So, what about "snapshot not
found exported by any running transaction" for the sixth test?
+ + Assert(databaseId != InvalidOid); + + oldctx = MemoryContextSwitchTo(TopTransactionContext);
How about "MemoryContextAlloc(TopTransactionContext, ..." in place of the two
palloc calls and dropping this? Not sure.
+ + snapshot->xmin = parseXactFromText(&s, "xmi:"); + Assert(snapshot->xmin != InvalidTransactionId); + snapshot->xmax = parseXactFromText(&s, "xma:"); + Assert(snapshot->xmax != InvalidTransactionId); + + if (snapshot->copied) + { + int xcnt = parseIntFromText(&s, "xcnt:", 0); + /* + * Unfortunately CopySnapshot allocates just one large chunk of memory, + * and makes snapshot->xip then point past the end of the fixed-size + * snapshot data. That way we cannot just repalloc(snapshot->xip, ...). + * Neither can we just change the base address of the snapshot because + * this address might still be saved somewhere. + */ + if (snapshot->xcnt < xcnt) + snapshot->xip = palloc(xcnt * sizeof(TransactionId)); + } + + i = 0; + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid; + } + snapshot->xcnt = i; + + /* + * We only write "sof:1" if the snapshot overflowed. If not, then there is + * no "sof:x" entry at all and parseBoolFromText() will just return false. + */ + snapshot->suboverflowed = parseBoolFromText(&s, "sof:"); + + if (!snapshot->suboverflowed) + { + if (snapshot->copied) + { + int sxcnt = parseIntFromText(&s, "sxcnt:", 0); + if (snapshot->subxcnt < sxcnt) + snapshot->subxip = palloc(sxcnt * sizeof(TransactionId)); + } + + i = 0; + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId) + snapshot->subxip[i++] = xid; + snapshot->subxcnt = i; + } + + MemoryContextSwitchTo(oldctx); + + /* Leave snapshot->curcid as is. */ + + /* + * No ProcArrayLock held here, we assume that a write is atomic. Also note + * that MyProc->xmin can go backwards here. However this is safe because + * the xmin we set here is the same as in the backend's proc->xmin whose + * snapshot we are copying. At this very moment, anybody computing a + * minimum will calculate at least this xmin as the overall xmin with or + * without us setting MyProc->xmin to this value. + * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */ + if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin)) + MyProc->xmin = snapshot->xmin;
The write is atomic, in the sense that this process cannot be interrupted with
the update in progress. You'd still need a memory barrier to make other
processes see the change immediately on all architectures we support. I'd
suggest taking ProcArrayLock and adding a comment to the effect that we could
make it lockless, should we ever adopt such methods. I don't advise testing
those waters as part of this patch.
+ + /* + * Check the checksum again to prevent a race condition. If the exporting + * backend invalidated its snapshot since we last checked or before we + * finish with this second check, then we fail here and error out, thus + * invalidating the snapshot we've built up. + * We could succeed here even though the exporting transaction is at the + * same time invalidating the checksum while we are checking it here for + * the second time. But even then we are good, because we have set up our + * MyProc record already. + */ + if (strcmp(cksum, syncSnapshotChksums[backendId]) != 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information")));
I was trying to decide whether this is vulnerable to the ABA problem, and I
believe it is not. It is possible that the original exporting transaction
completed and a new transaction exported the same snapshot, all between our
first checksum comparison and now. However, this would also mean that no new
permanent transactions IDs got allocated in the gap. So I think we're good.
This message should continue to mirror the one from the earlier test, of course.
+ + return true; + } + + diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 45b92a0..f74ea76 100644
+ Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot()); + + /* + * If we are in read committed mode then the next query will execute with a + * new snapshot making this function call quite useless. + */ + if (!IsolationUsesXactSnapshot()) + ereport(WARNING, + (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION), + errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE"), + errhint("A transaction of isolation level READ COMMITTED gives you a new snapshot for each query.")));
Again, REPEATABLE READ is a perfectly good isolation level for this use. The
test allows it, but the error message does not.
nm
Attachments:
test-drop.plapplication/x-perlDownload
Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?
On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch <noah@leadboat.com> wrote:
Just to clarify, I was envisioning something like:
typedef struct { bool valid; char value[16]; } snapshotChksum;
I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.
I didn't like it in the beginning but later on I adopted your
proposal. I have also changed the locking to be more natural. Even
though we don't really need it, I am now grabbing shared ProcArrayLock
for any reads of shared memory and exclusive lock for writes. Of
course no additional lock is taken if the feature is not used.
You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.
Yeah, that would do what you wanted to illustrate but it truely cannot
be considered the standard use case :-)
Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.Okay. Had no special reason to believe otherwise, just didn't know.
This is one part where I'd like someone more experienced than me look into it.
Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?
As discussed with Noah off-list this is just not an MVCC safe
operation. You could hit on this in a regular SERIALIZABLE transaction
as well: Somebody else can delete a table and you won't be able to
access it anymore. This is also the reason why pg_dump in the
beginning gets a shared lock on every table that it will dump later
on.
Thanks for the review Noah,
Joachim
Attachments:
syncsnapshots.3.difftext/x-patch; charset=US-ASCII; name=syncsnapshots.3.diffDownload+545-0
On Sun, Jan 30, 2011 at 12:36:12PM -0500, Joachim Wieland wrote:
Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?
No major opinion here. The second best fit after a new section would be System
Administration Functions, inasmuch as that's the bucket for miscellaneous
functions. I have an idea for improving structure here, but that would go
beyond the purview of this patch.
This version looks sound; I've marked it Ready for Committer.
Thanks,
nm
Looking into this patch.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I have two questions:
1. why are you using the expansible char array stuff instead of using
the StringInfo facility?
2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
1. why are you using the expansible char array stuff instead of using
the StringInfo facility?2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
We don't need a cryptographically secure hash.
There is no special reason for why it is like it is, I just didn't
think of the better alternatives that you are proposing.
Should I send an updated patch? Anything else?
Thanks for the review,
Joachim
Excerpts from Joachim Wieland's message of vie feb 18 21:41:51 -0300 2011:
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:1. why are you using the expansible char array stuff instead of using
the StringInfo facility?2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?We don't need a cryptographically secure hash.
Ok.
Should I send an updated patch? Anything else?
No need, I'm halfway through already.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support