Snapshot synchronization, again...
The snapshot synchronization discussion from the parallel pg_dump
patch somehow ended without a clear way to go forward.
Let me sum up what has been brought up and propose a short- and
longterm solution.
Summary:
Passing snapshot sync information can be done either:
a) by returning complete snapshot information from the backend to the
client so that the client can pass it along to a different backend
b) or by returning only a unique identifier to the client and storing
the actual snapshot data somewhere on the server side
Advantage of a: no memory is used in the backend and no memory needs
to get cleaned up, it is also theoretically possible that we could
forward that data to a hot standby server and do e.g. a dump partially
on the master server and partially on the hot standby server or among
several hot standby servers.
Disadvantage of a: The snapshot must be validated to make sure that
its information is still current, it might be difficult to cover all
cases of this validation. A client can not only access exactly a
published snapshot, but just about any snapshot that fits and passes
the validation checks (this is more a disadvantage than an advantage
because it allows to see a database state that never existed in
reality).
Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues. Note that we are also free to change
the internal format of the chunk of data we return whenever we like,
so we are free to enhance this feature in the future, transparently to
the client.
Thoughts?
Joachim
Excerpts from Joachim Wieland's message of jue dic 30 09:31:47 -0300 2010:
Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.
Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues.
So one registered snapshot per transaction? Sounds a reasonable
limitation (I doubt there's a use case for more than that, anyway).
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?
I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
best regards,
Florian Pflug
On 30.12.2010 16:49, Florian Pflug wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
doesn't modify the database itself, it could still send NOTIFYs or call
a PL function to do all sorts of things outside the database. Imagine
that you're paranoid about data integrity, and have a security definer
function that runs cross checks on the data. If it finds any
anomalities, it wakes up the operator or forces shutdown or similar.
Now a malicious user could set a snapshot that passes the basic validity
checks, ie. xmin >= GlobalXmin, but contains a combination of still
in-progress that never existed in reality. If he then calls the
paranoia-function, it would see an inconsistent state of committed
tuples and get upset.
Maybe that's a bit far-stretched, but it's not entirely clear that
running with an inconsistent snapshot is harmless.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).
I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?
Joachim
On Thu, Dec 30, 2010 at 9:49 AM, Florian Pflug <fgp@phlo.org> wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?
I am not talking about having synchronized snapshots among standby
servers at all.
I am only proposing a client API that will work for this future idea as well.
I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?
What's the use case for it? As soon as nobody comes up with a
reasonable use case for it, let's aim for the robust version.
Joachim
On 12/30/2010 10:45 PM, Heikki Linnakangas wrote:
On 30.12.2010 16:49, Florian Pflug wrote:
On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
doesn't modify the database itself, it could still send NOTIFYs or call
a PL function to do all sorts of things outside the database. Imagine
that you're paranoid about data integrity, and have a security definer
function that runs cross checks on the data. If it finds any
anomalities, it wakes up the operator or forces shutdown or similar.
are people actually doing that in reality? I'm also having a hard time
picturing a realistic example of what that "data integrity check"
function would actually being able to check with default isolation mode
and concurrent activity...
Now a malicious user could set a snapshot that passes the basic validity
checks, ie. xmin >= GlobalXmin, but contains a combination of still
in-progress that never existed in reality. If he then calls the
paranoia-function, it would see an inconsistent state of committed
tuples and get upset.
sure but I would expect being able to set a snapshot requiring either
superuser or some sort of "WITH SNAPSHOT" grant thingy - and in general
there are much more trivial and not that obscure scenarios a "normal"
user can cause the admin to get paged :)
Maybe that's a bit far-stretched, but it's not entirely clear that
running with an inconsistent snapshot is harmless.
well there has been some discussion with to the SSI stuff that we might
want to reconsider our definition of "read-only" maybe that would be the
right way to approach the problem?
Stefan
Excerpts from Joachim Wieland's message of vie dic 31 00:15:57 -0300 2010:
On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?
A backend can have any number of snapshots registered, and those don't
allow GlobalXmin to advance. Consider an open cursor, for example.
Even if the rest of the transaction is read committed, the snapshot
registered by the open cursor still holds back GlobalXmin. My
(handwavy) idea is that whenever the transaction calls
pg_publish_snapshot(), said snapshot is registered, which makes it safe
to use even if the transaction continues to operate and obtain newer
snapshots.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Dec 31, 2010 at 8:28 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
A backend can have any number of snapshots registered, and those don't
allow GlobalXmin to advance. Consider an open cursor, for example.
Even if the rest of the transaction is read committed, the snapshot
registered by the open cursor still holds back GlobalXmin. My
(handwavy) idea is that whenever the transaction calls
pg_publish_snapshot(), said snapshot is registered, which makes it safe
to use even if the transaction continues to operate and obtain newer
snapshots.
Cool, even better that this is taken care of already :-)
Thanks,
Joachim
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
So here's the patch implementing this idea.
I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().
A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).
There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.
These are the implementation details and restrictions of the patch:
The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
- adds its own xid (if any) to the xip-array
The importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwards
The snapshot is invalidated on transaction commit/rollback as well as when doing
"prepare transaction".
Comments welcome.
Joachim
Attachments:
pg_import_export_snapshot.difftext/x-patch; charset=US-ASCII; name=pg_import_export_snapshot.diffDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 95beba8..c24150f 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 ****
--- 124,130 ----
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
+ size = add_size(size, SyncSnapshotShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 228,233 ****
--- 229,235 ----
BTreeShmemInit();
SyncScanShmemInit();
AsyncShmemInit();
+ SyncSnapshotInit();
#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 980996e..e851bcd 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 50,60 ****
--- 50,62 ----
#include "access/transam.h"
#include "access/xact.h"
#include "access/twophase.h"
+ #include "libpq/md5.h"
#include "miscadmin.h"
#include "storage/procarray.h"
#include "storage/spin.h"
#include "storage/standby.h"
#include "utils/builtins.h"
+ #include "utils/bytea.h"
#include "utils/snapmgr.h"
*************** static int KnownAssignedXidsGetAndSetXmi
*** 159,164 ****
--- 161,170 ----
static TransactionId KnownAssignedXidsGetOldestXmin(void);
static void KnownAssignedXidsDisplay(int trace_level);
+ typedef char snapshotChksum[16];
+ static snapshotChksum *syncSnapshotChksums;
+ static Snapshot exportedSnapshot;
+
/*
* Report shared-memory space needed by CreateSharedProcArray.
*/
*************** KnownAssignedXidsDisplay(int trace_level
*** 3065,3067 ****
--- 3071,3335 ----
pfree(buf.data);
}
+
+
+ /*
+ * Report space needed for our shared memory area, which is basically an
+ * md5 checksum per connection.
+ */
+ Size
+ SyncSnapshotShmemSize(void)
+ {
+ return PROCARRAY_MAXPROCS * sizeof(snapshotChksum);
+ }
+
+ void
+ SyncSnapshotInit(void)
+ {
+ Size size;
+ bool found;
+ int i;
+
+ size = SyncSnapshotShmemSize();
+
+ syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums",
+ size, &found);
+ if (!found)
+ for (i = 0; i < PROCARRAY_MAXPROCS; i++)
+ memset(syncSnapshotChksums[i], 0, sizeof(snapshotChksum));
+ }
+
+ void
+ InvalidateExportedSnapshot(void)
+ {
+ if (syncSnapshotChksums[MyBackendId][0] == '\0')
+ return;
+
+ memset(syncSnapshotChksums[MyBackendId], 0, sizeof(snapshotChksum));
+
+ UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner);
+ exportedSnapshot = NULL;
+ }
+
+ static void
+ snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...)
+ {
+ va_list ap;
+ int ret;
+
+ if (*bufsize_left < 64)
+ {
+ /* enlarge buffer by 1024 bytes */
+ *buffer = repalloc(*buffer, *bufsize_filled + 1024);
+ *bufsize_left = *bufsize_filled + 1024;
+ }
+
+ va_start(ap, fmt);
+ ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap);
+ va_end(ap);
+
+ /* There shouldn't be any error, we leave enough room for the data that we
+ * are writing (only numbers basically). */
+ Assert(ret < *bufsize_left && ret > 0);
+
+ *bufsize_left -= ret;
+ *bufsize_filled += ret;
+
+ Assert(strlen(*buffer) == *bufsize_filled);
+ }
+
+ bytea *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ int bufsize = 1024;
+ int bufsize_filled = 0; /* doesn't include NUL byte */
+ int bufsize_left = bufsize - bufsize_filled;
+ char *buf = (char *) palloc(bufsize);
+ snapshotChksum cksum;
+ int i;
+
+ /* In a subtransaction we don't see our subxip values in the snapshot so
+ * they would be missing in the backend applying it. */
+ if (GetCurrentTransactionNestLevel() != 1)
+ elog(ERROR, "Can only export a snapshot from a top level transaction.");
+
+ /* Write up all the data that we return */
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "did:%d ", MyDatabaseId);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "bid:%d ", MyBackendId);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xmi:%d ", snapshot->xmin);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xma:%d ", snapshot->xmax);
+ for (i = 0; i < snapshot->xcnt; i++)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xip:%d ", snapshot->xip[i]);
+ /*
+ * Finally add our own XID if we have one, since by definition we will
+ * still be running when the other transaction takes over the snapshot.
+ */
+ if (TransactionIdIsValid(GetTopTransactionIdIfAny()))
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xip:%d ", GetTopTransactionIdIfAny());
+ if (snapshot->suboverflowed)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sof:1 ");
+ else
+ for (i = 0; i < snapshot->subxcnt; i++)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sxp:%d ", snapshot->subxip[i]);
+
+ /* Register the snapshot */
+ snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ exportedSnapshot = snapshot;
+
+ pg_md5_binary(buf, bufsize_filled, (void *) &cksum);
+
+ memcpy(syncSnapshotChksums[MyBackendId], &cksum, sizeof(snapshotChksum));
+
+ if (!XactReadOnly)
+ elog(WARNING, "A snapshot exporting function should be readonly.");
+
+ return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf)));
+ }
+
+ static Oid
+ parseOidFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ Oid oid;
+
+ if (!p)
+ return InvalidOid;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidOid;
+ *n = '\0';
+ oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return oid;
+ }
+
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ /* It's safe to overload parseOid to parse 0/1. This returns false if the
+ * entry could not be found, at least as long as InvalidOid is defined to be 0. */
+ Assert(InvalidOid == (Oid) 0);
+ return (bool) parseOidFromText(s, prefix);
+ }
+
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ TransactionId xid;
+
+ if (!p)
+ return InvalidTransactionId;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidTransactionId;
+ *n = '\0';
+ xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return xid;
+ }
+
+ bool
+ ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
+ {
+ snapshotChksum cksum;
+ Oid databaseId;
+ Oid backendId;
+ int i;
+ TransactionId xid;
+ int len = VARSIZE_ANY_EXHDR(snapshotData);
+ char *s = palloc(len + 1);
+
+ if (GetCurrentTransactionNestLevel() != 1)
+ elog(ERROR, "Can only import a snapshot to a top level transaction.");
+
+ Assert(len > 0);
+ strncpy(s, VARDATA_ANY(snapshotData), len);
+ s[len] = '\0';
+
+ pg_md5_binary(s, strlen(s), (void *) &cksum);
+
+ if ((databaseId = parseOidFromText(&s, "did:")) == InvalidOid)
+ return false;
+ if (databaseId != MyDatabaseId)
+ return false;
+ if ((backendId = parseOidFromText(&s, "bid:")) == InvalidOid)
+ return false;
+ if (backendId == MyBackendId)
+ return false;
+
+ /*
+ * Lock considerations:
+ *
+ * syncSnapshotChksums[backendId] is only changed by the backend with ID
+ * backendID and read by another backend that is asked to import a
+ * snapshot.
+ * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum
+ * of a valid snapshot.
+ * Every comparision to the checksum while it is being written or
+ * deleted is okay to fail, so we don't need to take a lock at all.
+ */
+
+ if (memcmp(&cksum, syncSnapshotChksums[backendId], sizeof(snapshotChksum)))
+ return false;
+
+ snapshot->xmin = parseXactFromText(&s, "xmi:");
+ Assert(snapshot->xmin != InvalidTransactionId);
+ snapshot->xmax = parseXactFromText(&s, "xma:");
+ Assert(snapshot->xmax != InvalidTransactionId);
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ {
+ if (xid == GetTopTransactionIdIfAny())
+ continue;
+ snapshot->xip[i++] = xid;
+ }
+ snapshot->xcnt = i;
+
+ /*
+ * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ * no "sof:x" entry at all and parseBoolFromText() will just return false.
+ */
+ snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ snapshot->subxip[i++] = xid;
+ snapshot->subxcnt = i;
+
+ /* Leave snapshot->curcid as is. */
+
+ /*
+ * No ProcArrayLock held here, we assume that a write is atomic. Also note
+ * that MyProc->xmin can go backwards here. However this is safe because
+ * the xmin we set here is the same as in the backend's proc->xmin whose
+ * snapshot we are copying. At this very moment, anybody computing a
+ * minimum will calculate at least this xmin as the overall xmin with or
+ * without us setting MyProc->xmin to this value.
+ * Instead we must check to not go forward (we might have opened a cursor
+ * in this transaction and still have its snapshot registered)
+ */
+ if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
+ MyProc->xmin = snapshot->xmin;
+
+ /*
+ * If we are in read committed mode then the next query will execute with a
+ * new snapshot making this function call quite useless.
+ */
+ if (!IsolationUsesXactSnapshot())
+ elog(WARNING, "A snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE");
+
+ return true;
+ }
+
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 273d8bd..75315bb 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 29,34 ****
--- 29,35 ----
#include "access/xact.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+ #include "utils/builtins.h"
#include "utils/memutils.h"
#include "utils/memutils.h"
#include "utils/resowner.h"
*************** AtEarlyCommit_Snapshot(void)
*** 523,528 ****
--- 524,530 ----
TopTransactionResourceOwner);
registered_xact_snapshot = false;
+ InvalidateExportedSnapshot();
}
/*
*************** AtEOXact_Snapshot(bool isCommit)
*** 559,561 ****
--- 561,585 ----
FirstSnapshotSet = false;
registered_xact_snapshot = false;
}
+
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ PG_RETURN_BYTEA_P(snapshotData);
+ }
+
+ Datum
+ pg_import_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = PG_GETARG_BYTEA_P(0);
+ bool ret = true;
+
+ if (ActiveSnapshotSet())
+ ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
+
+ ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
+
+ PG_RETURN_BOOL(ret);
+ }
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c624243..fa06f07 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 3375,3380 ****
--- 3375,3384 ----
DESCR("cancel a server process' current query");
DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
DESCR("terminate a server process");
+ DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
+ DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ ));
+ DESCR("import a snapshot");
DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
DESCR("prepare for taking an online backup");
DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ea030d6..a1ac8c6 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void XidCacheRemoveRunningXids(Tr
*** 71,74 ****
--- 71,81 ----
int nxids, const TransactionId *xids,
TransactionId latestXid);
+ extern Size SyncSnapshotShmemSize(void);
+ extern void SyncSnapshotInit(void);
+
+ extern bytea *ExportSnapshot(Snapshot snapshot);
+ extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot);
+ extern void InvalidateExportedSnapshot(void);
+
#endif /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f03647b..d0e3f9a 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern void AtSubAbort_Snapshot(int leve
*** 43,46 ****
--- 43,49 ----
extern void AtEarlyCommit_Snapshot(void);
extern void AtEOXact_Snapshot(bool isCommit);
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern Datum pg_import_snapshot(PG_FUNCTION_ARGS);
+
#endif /* SNAPMGR_H */
On Jan7, 2011, at 12:41 , Joachim Wieland wrote:
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
These are the implementation details and restrictions of the patch:The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
That seems overly strict. Wouldn't a warning about this in the docs suffice?
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
If that is necessary, wouldn't a sub-transaction that was sub-committed before
exporting the snapshot also pose a problem? The snapshot wouldn't include
the sub-transaction's xid, so once the exporting transaction has committed any
importing transaction would suddenly see the sub-transaction's changes.
- adds its own xid (if any) to the xip-array
Per the previous paragraph, I believe it should also add the xid's of all
non-aborted sub-transactions to the subxip-array, overflowing the array if
necessary.
The importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
Sounds fine so far.
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwards
Change a snapshot mid-transaction like that seems risky. I'd suggest to simply
decree that an importing transaction must be read-only and must not yet have set
its snapshot. So effectively, the only allowed calling sequence would be
begin;
set transaction isolation level serializable read only;
pg_import_snapshot(...);
best regards,
Florian Pflug
Hello Joachim,
I'm reviewing this patch for CommitFest 2011-01.
On Fri, Jan 07, 2011 at 06:41:38AM -0500, Joachim Wieland wrote:
On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe@mcknight.de> wrote:
What I am proposing now is the following:
We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.
I like this design. It makes the backend's use of resources to track exported
snapshots very simple. It also avoids pessimizing the chances that we can, at
some point, synchronize snapshots across various standby clusters without
changing the user-facing API.
So here's the patch implementing this idea.
I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.These are the implementation details and restrictions of the patch:
The exporting transaction
- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
- adds its own xid (if any) to the xip-arrayThe importing transaction
- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
I don't know, but this restriction does not seem onerous.
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwardsThe snapshot is invalidated on transaction commit/rollback as well as when doing
"prepare transaction".Comments welcome.
General points:
Basics: patch has the correct format and applies cleanly (offset in pg_proc.h).
It includes no tests, but perhaps it should not. Really testing this requires
concurrency, and our test framework is not set up for that. We could add some
cheap tests to ensure the functions fail cleanly in some basic situations, but
that would be about it. The patch does not, but should, document the new
functions. The new functions also need comments preceding their definitions,
similar to those found on all other functions in procarray.c.
You add four elog() calls, all of which should be ereport() to facilitate
translation. Also consider suitable error codes. The error messages do not
follow style; in particular, they begin with capitals. See the style guide,
http://www.postgresql.org/docs/current/interactive/error-style-guide.html.
ImportSnapshot() has no paths that justify returning false. It should always
return true or throw a suitable error.
See below for various specific comments.
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 95beba8..c24150f 100644 *** a/src/backend/storage/ipc/ipci.c --- b/src/backend/storage/ipc/ipci.c *************** CreateSharedMemoryAndSemaphores(bool mak *** 124,129 **** --- 124,130 ---- size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, SyncSnapshotShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *************** CreateSharedMemoryAndSemaphores(bool mak *** 228,233 **** --- 229,235 ---- BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + SyncSnapshotInit();#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 980996e..e851bcd 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *************** *** 50,60 **** --- 50,62 ---- #include "access/transam.h" #include "access/xact.h" #include "access/twophase.h" + #include "libpq/md5.h" #include "miscadmin.h" #include "storage/procarray.h" #include "storage/spin.h" #include "storage/standby.h" #include "utils/builtins.h" + #include "utils/bytea.h" #include "utils/snapmgr.h"*************** static int KnownAssignedXidsGetAndSetXmi *** 159,164 **** --- 161,170 ---- static TransactionId KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level);+ typedef char snapshotChksum[16]; + static snapshotChksum *syncSnapshotChksums; + static Snapshot exportedSnapshot; + /* * Report shared-memory space needed by CreateSharedProcArray. */ *************** KnownAssignedXidsDisplay(int trace_level *** 3065,3067 **** --- 3071,3335 ----pfree(buf.data); } + + + /* + * Report space needed for our shared memory area, which is basically an + * md5 checksum per connection. + */ + Size + SyncSnapshotShmemSize(void) + { + return PROCARRAY_MAXPROCS * sizeof(snapshotChksum); + } + + void + SyncSnapshotInit(void) + { + Size size; + bool found; + int i; + + size = SyncSnapshotShmemSize(); + + syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums", + size, &found); + if (!found) + for (i = 0; i < PROCARRAY_MAXPROCS; i++) + memset(syncSnapshotChksums[i], 0, sizeof(snapshotChksum)); + }
No need for a loop:
memset(syncSnapshotchksums, 0, PROCARRAY_MAXPROCS * sizeof(snapshotChksum));
"00000000000000000000" is a valid md5 message digest. To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. Maybe that hole
is acceptable, though.
+ + void + InvalidateExportedSnapshot(void) + { + if (syncSnapshotChksums[MyBackendId][0] == '\0') + return;
The first byte of a valid message digest will regularly be zero.
+ + memset(syncSnapshotChksums[MyBackendId], 0, sizeof(snapshotChksum)); + + UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner); + exportedSnapshot = NULL; + } + + static void + snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...) + { + va_list ap; + int ret; + + if (*bufsize_left < 64) + { + /* enlarge buffer by 1024 bytes */ + *buffer = repalloc(*buffer, *bufsize_filled + 1024); + *bufsize_left = *bufsize_filled + 1024; + } + + va_start(ap, fmt); + ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap); + va_end(ap); + + /* There shouldn't be any error, we leave enough room for the data that we + * are writing (only numbers basically). */ + Assert(ret < *bufsize_left && ret > 0); + + *bufsize_left -= ret; + *bufsize_filled += ret; + + Assert(strlen(*buffer) == *bufsize_filled); + } + + bytea * + ExportSnapshot(Snapshot snapshot) + { + int bufsize = 1024; + int bufsize_filled = 0; /* doesn't include NUL byte */ + int bufsize_left = bufsize - bufsize_filled; + char *buf = (char *) palloc(bufsize); + snapshotChksum cksum; + int i; + + /* In a subtransaction we don't see our subxip values in the snapshot so + * they would be missing in the backend applying it. */ + if (GetCurrentTransactionNestLevel() != 1) + elog(ERROR, "Can only export a snapshot from a top level transaction."); + + /* Write up all the data that we return */ + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "did:%d ", MyDatabaseId); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "bid:%d ", MyBackendId); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xmi:%d ", snapshot->xmin); + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xma:%d ", snapshot->xmax); + for (i = 0; i < snapshot->xcnt; i++) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xip:%d ", snapshot->xip[i]); + /* + * Finally add our own XID if we have one, since by definition we will + * still be running when the other transaction takes over the snapshot. + */ + if (TransactionIdIsValid(GetTopTransactionIdIfAny())) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "xip:%d ", GetTopTransactionIdIfAny()); + if (snapshot->suboverflowed) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "sof:1 "); + else + for (i = 0; i < snapshot->subxcnt; i++) + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, + "sxp:%d ", snapshot->subxip[i]); + + /* Register the snapshot */ + snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner); + exportedSnapshot = snapshot; + + pg_md5_binary(buf, bufsize_filled, (void *) &cksum); + + memcpy(syncSnapshotChksums[MyBackendId], &cksum, sizeof(snapshotChksum));
No need for an intermediate variable:
pg_md5_binary(buf, bufsize_filled, &syncSnapshotChksums[MyBackendId]);
+ + if (!XactReadOnly) + elog(WARNING, "A snapshot exporting function should be readonly.");
There are legitimate use cases for copying the snapshot of a read-write
transaction. Consider a task like calculating some summaries for insert into a
table. To speed this up, you might notionally partition the source data and
have each of several workers calculate each partition. I'd vote for removing
this warning entirely.
If we do keep the warning, it's the transaction that should be read-only, not a
"function". My first reaction to the text was "why not?". If the documentation
for pg_export_snapshot explains, that might be enough. Primary wording like
"importing transactions will not see changes from this transaction" with wording
like yours in a HINT might help.
+ + return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf))); + } + + static Oid + parseOidFromText(char **s, const char *prefix) + { + char *n, *p = strstr(*s, prefix); + Oid oid; + + if (!p) + return InvalidOid; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return InvalidOid; + *n = '\0'; + oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p))); + *s = n + 1; + return oid; + } + + static bool + parseBoolFromText(char **s, const char *prefix) + { + /* It's safe to overload parseOid to parse 0/1. This returns false if the + * entry could not be found, at least as long as InvalidOid is defined to be 0. */ + Assert(InvalidOid == (Oid) 0); + return (bool) parseOidFromText(s, prefix); + } + + static TransactionId + parseXactFromText(char **s, const char *prefix) + { + char *n, *p = strstr(*s, prefix); + TransactionId xid; + + if (!p) + return InvalidTransactionId; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return InvalidTransactionId; + *n = '\0'; + xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p))); + *s = n + 1; + return xid; + } + + bool + ImportSnapshot(bytea *snapshotData, Snapshot snapshot) + { + snapshotChksum cksum; + Oid databaseId; + Oid backendId; + int i; + TransactionId xid; + int len = VARSIZE_ANY_EXHDR(snapshotData); + char *s = palloc(len + 1); + + if (GetCurrentTransactionNestLevel() != 1) + elog(ERROR, "Can only import a snapshot to a top level transaction."); + + Assert(len > 0);
This is user-facing; "SELECT pg_import_snapshot('')" must not trigger an
assertion failure.
+ strncpy(s, VARDATA_ANY(snapshotData), len); + s[len] = '\0'; + + pg_md5_binary(s, strlen(s), (void *) &cksum);
You know strlen(s) == len; no need to recalculate.
+ + if ((databaseId = parseOidFromText(&s, "did:")) == InvalidOid) + return false; + if (databaseId != MyDatabaseId) + return false; + if ((backendId = parseOidFromText(&s, "bid:")) == InvalidOid) + return false; + if (backendId == MyBackendId) + return false;
InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?
My backend IDs seem to start at 1 in practice. However, by inserting a
too-large bid, an attacker could have us overrun syncSnapshotChksums. For that,
we need to test 0 <= bid <= PROCARRAY_MAXPROCS. Example crasher:
select pg_import_snapshot('did:16384 bid:4000000000 xmi:795 xma:795');
I'm thinking the order of validation should be:
1. Extract bid.
2. Validate bid. Throw error on out-of-bounds.
3. Confirm checksum. Throw error on mismatch.
4. Extract did. Assert databaseId != InvalidOid.
5. Error if databaseId != MyDatabaseid.
+ + /* + * Lock considerations: + * + * syncSnapshotChksums[backendId] is only changed by the backend with ID + * backendID and read by another backend that is asked to import a + * snapshot. + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum + * of a valid snapshot. + * Every comparision to the checksum while it is being written or + * deleted is okay to fail, so we don't need to take a lock at all. + */ + + if (memcmp(&cksum, syncSnapshotChksums[backendId], sizeof(snapshotChksum))) + return false; + + snapshot->xmin = parseXactFromText(&s, "xmi:"); + Assert(snapshot->xmin != InvalidTransactionId); + snapshot->xmax = parseXactFromText(&s, "xma:"); + Assert(snapshot->xmax != InvalidTransactionId); + + i = 0; + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid;
This works on a virgin GetSnapshotData() snapshot, which always has enough space
(sized according to max_connections). A snapshot from CopySnapshot(), however,
has just enough space for the original usage. I get a SIGSEGV at this line with
the attached test script. If I change things around so xip starts non-NULL, I
get messages like these as the code scribbles past the end of the allocation:
WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93280
WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93350
The last CopySnapshot before the crash had this stack trace:
#0 CopySnapshot (snapshot=0xc27980) at snapmgr.c:203
#1 0x00000000008109e5 in PushActiveSnapshot (snap=0xc27980) at snapmgr.c:283
#2 0x00000000006e1c13 in PortalStart (portal=0xcaf7b8, params=0x0, snapshot=0x0) at pquery.c:508
#3 0x00000000006dc411 in exec_simple_query (query_string=0xd3ee38 "SELECT pg_import_snapshot('did:16384 bid:2 xmi:756 xma:756 xip:756 ')") at postgres.c:1020
#4 0x00000000006e0633 in PostgresMain (argc=2, argv=0xcb9760, username=0xc90658 "nm") at postgres.c:3939
#5 0x0000000000695520 in BackendRun (port=0xcbb7d0) at postmaster.c:3578
#6 0x0000000000694bec in BackendStartup (port=0xcbb7d0) at postmaster.c:3263
#7 0x0000000000691f85 in ServerLoop () at postmaster.c:1449
#8 0x000000000069174f in PostmasterMain (argc=1, argv=0xc8af30) at postmaster.c:1110
#9 0x000000000060e4e2 in main (argc=1, argv=0xc8af30) at main.c:199
Do you have a recipe for a getting ImportSnapshot to see a non-CopySnapshot
snapshot that I could use to continue testing? All the things I tried ended
this way.
+ } + snapshot->xcnt = i; + + /* + * We only write "sof:1" if the snapshot overflowed. If not, then there is + * no "sof:x" entry at all and parseBoolFromText() will just return false. + */ + snapshot->suboverflowed = parseBoolFromText(&s, "sof:"); + + i = 0; + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId) + snapshot->subxip[i++] = xid; + snapshot->subxcnt = i; + /* Leave snapshot->curcid as is. */ + + /* + * No ProcArrayLock held here, we assume that a write is atomic. Also note + * that MyProc->xmin can go backwards here. However this is safe because + * the xmin we set here is the same as in the backend's proc->xmin whose + * snapshot we are copying. At this very moment, anybody computing a + * minimum will calculate at least this xmin as the overall xmin with or + * without us setting MyProc->xmin to this value.
Though unlikely, the other backend may not even exist by now. Indeed, that
could have happened and RecentGlobalXmin advanced since we compared the
checksum. Not sure what the right locking is here, but something is needed.
+ * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */
I'm thinking this case should yield an ERROR. Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.
+ if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin)) + MyProc->xmin = snapshot->xmin; + + /* + * If we are in read committed mode then the next query will execute with a + * new snapshot making this function call quite useless. + */ + if (!IsolationUsesXactSnapshot()) + elog(WARNING, "A snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE");
The message should convey that the test (correctly) allows either REPEATABLE
READ or SERIALIZABLE.
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? If it does
work, we should take the precedent of SET LOCAL and issue no warning. If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.
This warning can appear twice due to the two calls to ImportSnapshot() in
pg_import_snapshot(); that's a bit ugly.
+ + return true; + } + diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 273d8bd..75315bb 100644 *** a/src/backend/utils/time/snapmgr.c --- b/src/backend/utils/time/snapmgr.c *************** *** 29,34 **** --- 29,35 ---- #include "access/xact.h" #include "storage/proc.h" #include "storage/procarray.h" + #include "utils/builtins.h" #include "utils/memutils.h" #include "utils/memutils.h" #include "utils/resowner.h" *************** AtEarlyCommit_Snapshot(void) *** 523,528 **** --- 524,530 ---- TopTransactionResourceOwner); registered_xact_snapshot = false;+ InvalidateExportedSnapshot();
}/* *************** AtEOXact_Snapshot(bool isCommit) *** 559,561 **** --- 561,585 ---- FirstSnapshotSet = false; registered_xact_snapshot = false; } + + Datum + pg_export_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot()); + PG_RETURN_BYTEA_P(snapshotData); + } + + Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
Is it valid to scribble directly on snapshots like this?
Show quoted text
+ + PG_RETURN_BOOL(ret); + } + diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index c624243..fa06f07 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** DATA(insert OID = 2171 ( pg_cancel_backe *** 3375,3380 **** --- 3375,3384 ---- DESCR("cancel a server process' current query"); DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ )); DESCR("terminate a server process"); + DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ )); + DESCR("export a snapshot"); + DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ )); + DESCR("import a snapshot"); DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ )); DESCR("prepare for taking an online backup"); DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ )); diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index ea030d6..a1ac8c6 100644 *** a/src/include/storage/procarray.h --- b/src/include/storage/procarray.h *************** extern void XidCacheRemoveRunningXids(Tr *** 71,74 **** --- 71,81 ---- int nxids, const TransactionId *xids, TransactionId latestXid);+ extern Size SyncSnapshotShmemSize(void); + extern void SyncSnapshotInit(void); + + extern bytea *ExportSnapshot(Snapshot snapshot); + extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot); + extern void InvalidateExportedSnapshot(void); + #endif /* PROCARRAY_H */ diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h index f03647b..d0e3f9a 100644 *** a/src/include/utils/snapmgr.h --- b/src/include/utils/snapmgr.h *************** extern void AtSubAbort_Snapshot(int leve *** 43,46 **** --- 43,49 ---- extern void AtEarlyCommit_Snapshot(void); extern void AtEOXact_Snapshot(bool isCommit);+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS); + extern Datum pg_import_snapshot(PG_FUNCTION_ARGS); + #endif /* SNAPMGR_H */
Attachments:
Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.
On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote:
"00000000000000000000" is a valid md5 message digest. To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. Maybe that hole
is acceptable, though.
In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.
+ + if (!XactReadOnly) + elog(WARNING, "A snapshot exporting function should be readonly.");There are legitimate use cases for copying the snapshot of a read-write
transaction. Consider a task like calculating some summaries for insert into a
table. To speed this up, you might notionally partition the source data and
have each of several workers calculate each partition. I'd vote for removing
this warning entirely.
Warning removed, adding this fact to the documentation only is fine with me.
InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?
Yes, there is a check in utils/init/postinit.c that makes sure that 0
is never a valid backendId. But anyway, I have now made this more
explicit by adding an own parse routine for ints.
+ while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid;This works on a virgin GetSnapshotData() snapshot, which always has enough space
(sized according to max_connections). A snapshot from CopySnapshot(), however,
has just enough space for the original usage. I get a SIGSEGV at this line with
the attached test script. If I change things around so xip starts non-NULL, I
get messages like these as the code scribbles past the end of the allocation:
This has been fixed. xip/subxip are now allocated separately if necessary.
Though unlikely, the other backend may not even exist by now. Indeed, that
could have happened and RecentGlobalXmin advanced since we compared the
checksum. Not sure what the right locking is here, but something is needed.
Good catch. What I have done now is a second check at the end of
ImportSnapshot(). At that point we have set up the new snapshot and
adjusted MyProc->xmin. If we succeed, then we are fine. If not we
throw an error and invalidate the whole transaction.
+ * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */I'm thinking this case should yield an ERROR. Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.
This can happen when you do:
DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? If it does
work, we should take the precedent of SET LOCAL and issue no warning. If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.
No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.
+ Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.
I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.
So thanks again and I'm looking forward to your next review... :-)
Joachim
Attachments:
syncsnapshots.2.difftext/x-patch; charset=US-ASCII; name=syncsnapshots.2.diffDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbac56..c96942a 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 ****
--- 124,130 ----
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
+ size = add_size(size, SyncSnapshotShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 228,233 ****
--- 229,235 ----
BTreeShmemInit();
SyncScanShmemInit();
AsyncShmemInit();
+ SyncSnapshotInit();
#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8b36df4..29c9426 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 50,60 ****
--- 50,63 ----
#include "access/transam.h"
#include "access/xact.h"
#include "access/twophase.h"
+ #include "libpq/md5.h"
#include "miscadmin.h"
#include "storage/procarray.h"
#include "storage/spin.h"
#include "storage/standby.h"
#include "utils/builtins.h"
+ #include "utils/bytea.h"
+ #include "utils/memutils.h"
#include "utils/snapmgr.h"
*************** static int KnownAssignedXidsGetAndSetXmi
*** 159,164 ****
--- 162,171 ----
static TransactionId KnownAssignedXidsGetOldestXmin(void);
static void KnownAssignedXidsDisplay(int trace_level);
+ typedef char snapshotChksum[33];
+ static snapshotChksum *syncSnapshotChksums;
+ static Snapshot exportedSnapshot;
+
/*
* Report shared-memory space needed by CreateSharedProcArray.
*/
*************** KnownAssignedXidsDisplay(int trace_level
*** 3065,3067 ****
--- 3072,3424 ----
pfree(buf.data);
}
+
+
+ /*
+ * Report space needed for our shared memory area, which is basically an
+ * md5 checksum per connection.
+ */
+ Size
+ SyncSnapshotShmemSize(void)
+ {
+ return PROCARRAY_MAXPROCS * sizeof(snapshotChksum);
+ }
+
+ void
+ SyncSnapshotInit(void)
+ {
+ Size size;
+ bool found;
+
+ size = SyncSnapshotShmemSize();
+
+ syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums",
+ size, &found);
+ if (!found)
+ memset(syncSnapshotChksums[0], 0, sizeof(snapshotChksum) * PROCARRAY_MAXPROCS);
+ }
+
+ void
+ InvalidateExportedSnapshot(void)
+ {
+ if (syncSnapshotChksums[MyBackendId][0] == '\0')
+ return;
+
+ memset(syncSnapshotChksums[MyBackendId], 0, sizeof(snapshotChksum));
+
+ UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner);
+ exportedSnapshot = NULL;
+ }
+
+ static void
+ snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...)
+ {
+ va_list ap;
+ int ret;
+
+ if (*bufsize_left < 64)
+ {
+ /* enlarge buffer by 1024 bytes */
+ *buffer = repalloc(*buffer, *bufsize_filled + 1024);
+ *bufsize_left = *bufsize_filled + 1024;
+ }
+
+ va_start(ap, fmt);
+ ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap);
+ va_end(ap);
+
+ /* There shouldn't be any error, we leave enough room for the data that we
+ * are writing (only numbers basically). */
+ Assert(ret < *bufsize_left && ret > 0);
+
+ *bufsize_left -= ret;
+ *bufsize_filled += ret;
+
+ Assert(strlen(*buffer) == *bufsize_filled);
+ }
+
+ bytea *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ int bufsize = 1024;
+ int bufsize_filled = 0; /* doesn't include NUL byte */
+ int bufsize_left = bufsize - bufsize_filled;
+ char *buf = (char *) palloc(bufsize);
+ int i;
+ TransactionId *children;
+ int nchildren;
+
+ /* In a subtransaction we don't see our open subxip values in the snapshot
+ * so they would be missing in the backend applying it. */
+ /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
+ * It's a valid transaction state but invalid for the requested operation. */
+ if (GetCurrentTransactionNestLevel() != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("can only export a snapshot from a top level transaction")));
+
+ /* We do however see our already committed subxip values and add them to
+ * the subxip array. */
+ nchildren = xactGetCommittedChildren(&children);
+
+ /* Write up all the data that we return */
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "did:%d ", MyDatabaseId);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "bid:%d ", MyBackendId);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xmi:%d ", snapshot->xmin);
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xma:%d ", snapshot->xmax);
+ /* Include our own transaction ID into the count if any. */
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xcnt:%d ", TransactionIdIsValid(GetTopTransactionIdIfAny()) ?
+ snapshot->xcnt + 1 : snapshot->xcnt);
+ for (i = 0; i < snapshot->xcnt; i++)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xip:%d ", snapshot->xip[i]);
+ /*
+ * Finally add our own XID if we have one, since by definition we will
+ * still be running when the other transaction takes over the snapshot.
+ */
+ if (TransactionIdIsValid(GetTopTransactionIdIfAny()))
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "xip:%d ", GetTopTransactionIdIfAny());
+ if (snapshot->suboverflowed || snapshot->subxcnt + nchildren > TOTAL_MAX_CACHED_SUBXIDS)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sof:1 ");
+ else
+ {
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sxcnt:%d ", snapshot->subxcnt + nchildren);
+ for (i = 0; i < snapshot->subxcnt; i++)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sxp:%d ", snapshot->subxip[i]);
+ /* Add already committed subtransactions. */
+ for (i = 0; i < nchildren; i++)
+ snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
+ "sxp:%d ", children[i]);
+ }
+
+ /*
+ * buf ends with a trailing space but we leave it in for simplicity. The
+ * parsing routines also depend on it.
+ */
+
+ /* Register the snapshot */
+ snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ exportedSnapshot = snapshot;
+
+ pg_md5_hash(buf, bufsize_filled, syncSnapshotChksums[MyBackendId]);
+
+ return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf)));
+ }
+
+ static Oid
+ parseOidFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ Oid oid;
+
+ if (!p)
+ return InvalidOid;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidOid;
+ *n = '\0';
+ oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return oid;
+ }
+
+ static int
+ parseIntFromText(char **s, const char *prefix, int notfound)
+ {
+ char *n, *p = strstr(*s, prefix);
+ int i;
+
+ if (!p)
+ return notfound;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return notfound;
+ *n = '\0';
+ i = DatumGetObjectId(DirectFunctionCall1(int4in, CStringGetDatum(p)));
+ *s = n + 1;
+ return i;
+ }
+
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ return (bool) parseIntFromText(s, prefix, 0);
+ }
+
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ TransactionId xid;
+
+ if (!p)
+ return InvalidTransactionId;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidTransactionId;
+ *n = '\0';
+ xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return xid;
+ }
+
+ bool
+ ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
+ {
+ snapshotChksum cksum;
+ Oid databaseId;
+ BackendId backendId;
+ int i;
+ TransactionId xid;
+ int len = VARSIZE_ANY_EXHDR(snapshotData);
+ char *s = palloc(len + 1);
+ MemoryContext oldctx;
+
+ /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
+ * It's a valid transaction state but invalid for the requested operation. */
+ if (GetCurrentTransactionNestLevel() != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("can only import a snapshot to a top level transaction")));
+
+ if (len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+
+ strncpy(s, VARDATA_ANY(snapshotData), len);
+ s[len] = '\0';
+
+ pg_md5_hash(s, len, (void *) &cksum);
+
+ databaseId = parseOidFromText(&s, "did:");
+ backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId);
+
+ if (databaseId != MyDatabaseId
+ || backendId == InvalidBackendId
+ /*
+ * Make sure backendId is in a reasonable range before using it as an
+ * array subscript.
+ */
+ || backendId >= PROCARRAY_MAXPROCS
+ || backendId < 0
+ || backendId == MyBackendId
+ /*
+ * Lock considerations:
+ *
+ * syncSnapshotChksums[backendId] is only changed by the backend with ID
+ * backendID and read by another backend that is asked to import a
+ * snapshot.
+ * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum
+ * of a valid snapshot.
+ * Every comparision to the checksum while it is being written or
+ * deleted is okay to fail, so we don't need to take a lock at all.
+ */
+ || strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+ }
+
+ Assert(databaseId != InvalidOid);
+
+ oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+ snapshot->xmin = parseXactFromText(&s, "xmi:");
+ Assert(snapshot->xmin != InvalidTransactionId);
+ snapshot->xmax = parseXactFromText(&s, "xma:");
+ Assert(snapshot->xmax != InvalidTransactionId);
+
+ if (snapshot->copied)
+ {
+ int xcnt = parseIntFromText(&s, "xcnt:", 0);
+ /*
+ * Unfortunately CopySnapshot allocates just one large chunk of memory,
+ * and makes snapshot->xip then point past the end of the fixed-size
+ * snapshot data. That way we cannot just repalloc(snapshot->xip, ...).
+ * Neither can we just change the base address of the snapshot because
+ * this address might still be saved somewhere.
+ */
+ if (snapshot->xcnt < xcnt)
+ snapshot->xip = palloc(xcnt * sizeof(TransactionId));
+ }
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ {
+ if (xid == GetTopTransactionIdIfAny())
+ continue;
+ snapshot->xip[i++] = xid;
+ }
+ snapshot->xcnt = i;
+
+ /*
+ * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ * no "sof:x" entry at all and parseBoolFromText() will just return false.
+ */
+ snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
+
+ if (!snapshot->suboverflowed)
+ {
+ if (snapshot->copied)
+ {
+ int sxcnt = parseIntFromText(&s, "sxcnt:", 0);
+ if (snapshot->subxcnt < sxcnt)
+ snapshot->subxip = palloc(sxcnt * sizeof(TransactionId));
+ }
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ snapshot->subxip[i++] = xid;
+ snapshot->subxcnt = i;
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Leave snapshot->curcid as is. */
+
+ /*
+ * No ProcArrayLock held here, we assume that a write is atomic. Also note
+ * that MyProc->xmin can go backwards here. However this is safe because
+ * the xmin we set here is the same as in the backend's proc->xmin whose
+ * snapshot we are copying. At this very moment, anybody computing a
+ * minimum will calculate at least this xmin as the overall xmin with or
+ * without us setting MyProc->xmin to this value.
+ * Instead we must check to not go forward (we might have opened a cursor
+ * in this transaction and still have its snapshot registered)
+ */
+ if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
+ MyProc->xmin = snapshot->xmin;
+
+ /*
+ * Check the checksum again to prevent a race condition. If the exporting
+ * backend invalidated its snapshot since we last checked or before we
+ * finish with this second check, then we fail here and error out, thus
+ * invalidating the snapshot we've built up.
+ * We could succeed here even though the exporting transaction is at the
+ * same time invalidating the checksum while we are checking it here for
+ * the second time. But even then we are good, because we have set up our
+ * MyProc record already.
+ */
+ if (strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+
+ return true;
+ }
+
+
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 45b92a0..f74ea76 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 29,34 ****
--- 29,35 ----
#include "access/xact.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+ #include "utils/builtins.h"
#include "utils/memutils.h"
#include "utils/memutils.h"
#include "utils/resowner.h"
*************** AtEarlyCommit_Snapshot(void)
*** 523,528 ****
--- 524,530 ----
TopTransactionResourceOwner);
registered_xact_snapshot = false;
+ InvalidateExportedSnapshot();
}
/*
*************** AtEOXact_Snapshot(bool isCommit)
*** 559,561 ****
--- 561,595 ----
FirstSnapshotSet = false;
registered_xact_snapshot = false;
}
+
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ PG_RETURN_BYTEA_P(snapshotData);
+ }
+
+ Datum
+ pg_import_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = PG_GETARG_BYTEA_P(0);
+ bool ret = true;
+
+ if (ActiveSnapshotSet())
+ ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
+
+ ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
+
+ /*
+ * If we are in read committed mode then the next query will execute with a
+ * new snapshot making this function call quite useless.
+ */
+ if (!IsolationUsesXactSnapshot())
+ ereport(WARNING,
+ (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION),
+ errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE"),
+ errhint("A transaction of isolation level READ COMMITTED gives you a new snapshot for each query.")));
+
+ PG_RETURN_BOOL(ret);
+ }
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..3a4bc6d 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 3391,3396 ****
--- 3391,3400 ----
DESCR("cancel a server process' current query");
DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
DESCR("terminate a server process");
+ DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
+ DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ ));
+ DESCR("import a snapshot");
DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
DESCR("prepare for taking an online backup");
DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 334f9a2..ee94451 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void XidCacheRemoveRunningXids(Tr
*** 71,74 ****
--- 71,81 ----
int nxids, const TransactionId *xids,
TransactionId latestXid);
+ extern Size SyncSnapshotShmemSize(void);
+ extern void SyncSnapshotInit(void);
+
+ extern bytea *ExportSnapshot(Snapshot snapshot);
+ extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot);
+ extern void InvalidateExportedSnapshot(void);
+
#endif /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 6d784d2..3dcb924 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern void AtSubAbort_Snapshot(int leve
*** 43,46 ****
--- 43,49 ----
extern void AtEarlyCommit_Snapshot(void);
extern void AtEOXact_Snapshot(bool isCommit);
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern Datum pg_import_snapshot(PG_FUNCTION_ARGS);
+
#endif /* SNAPMGR_H */
On Wed, Jan 19, 2011 at 12:12:39AM -0500, Joachim Wieland wrote:
Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.
Great. I do get an assertion failure with this sequence:
BEGIN; SELECT pg_export_snapshot(); ROLLBACK;
SELECT 1;
TRAP: FailedAssertion("!(RegisteredSnapshots > 0)", File: "snapmgr.c", Line: 430)
Perhaps some cleanup is missing from a ROLLBACK path?
On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote:
"00000000000000000000" is a valid md5 message digest. ?To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. ?Maybe that hole
is acceptable, though.In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.
Just to clarify, I was envisioning something like:
typedef struct { bool valid; char value[16]; } snapshotChksum;
I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.
+ ? ? ?* Instead we must check to not go forward (we might have opened a cursor + ? ? ?* in this transaction and still have its snapshot registered) + ? ? ?*/I'm thinking this case should yield an ERROR. ?Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.This can happen when you do:
DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;
I see now. You set snapshot->xmin unconditionally, but never move MyProc->xmin
forward, only backward. Yes; that seems just right.
I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? ?If it does
work, we should take the precedent of SET LOCAL and issue no warning. ?If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.
You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.
Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.
Okay. Had no special reason to believe otherwise, just didn't know.
I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.
Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?
General tests involving only DML seem to do the right thing. Subtransaction
handling looks sound. Patch runs cleanly according to Valgrind.
So thanks again and I'm looking forward to your next review... :-)
Hope this one helps, too.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 8b36df4..29c9426 100644
+ bytea * + ExportSnapshot(Snapshot snapshot) + { + int bufsize = 1024; + int bufsize_filled = 0; /* doesn't include NUL byte */ + int bufsize_left = bufsize - bufsize_filled; + char *buf = (char *) palloc(bufsize); + int i; + TransactionId *children; + int nchildren; + + /* In a subtransaction we don't see our open subxip values in the snapshot + * so they would be missing in the backend applying it. */ + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less? + * It's a valid transaction state but invalid for the requested operation. */ + if (GetCurrentTransactionNestLevel() != 1) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("can only export a snapshot from a top level transaction")));
When I'm not sure about questions like this (as in, 99.8% of the time), I
usually grep one of the .po files for messages that seem similar, then see what
they used. In this case, the most-similar is "SET TRANSACTION ISOLATION LEVEL
must not be called in a subtransaction" in assign_XactIsoLevel(), which uses
ERRCODE_ACTIVE_SQL_TRANSACTION. I'd also consider making the message similar to
that one, like "cannot export a snapshot from a subtransaction".
+ static int + parseIntFromText(char **s, const char *prefix, int notfound) + { + char *n, *p = strstr(*s, prefix); + int i; + + if (!p) + return notfound; + p += strlen(prefix); + n = strchr(p, ' '); + if (!n) + return notfound; + *n = '\0'; + i = DatumGetObjectId(DirectFunctionCall1(int4in, CStringGetDatum(p)));
DatumGetInt32
+ bool + ImportSnapshot(bytea *snapshotData, Snapshot snapshot) + { + snapshotChksum cksum; + Oid databaseId; + BackendId backendId; + int i; + TransactionId xid; + int len = VARSIZE_ANY_EXHDR(snapshotData); + char *s = palloc(len + 1); + MemoryContext oldctx; + + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less? + * It's a valid transaction state but invalid for the requested operation. */ + if (GetCurrentTransactionNestLevel() != 1) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("can only import a snapshot to a top level transaction")));
See discussion of similar code above.
+ + if (len == 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information"))); + + strncpy(s, VARDATA_ANY(snapshotData), len); + s[len] = '\0'; + + pg_md5_hash(s, len, (void *) &cksum); + + databaseId = parseOidFromText(&s, "did:"); + backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId); + + if (databaseId != MyDatabaseId + || backendId == InvalidBackendId + /* + * Make sure backendId is in a reasonable range before using it as an + * array subscript. + */ + || backendId >= PROCARRAY_MAXPROCS + || backendId < 0 + || backendId == MyBackendId + /* + * Lock considerations: + * + * syncSnapshotChksums[backendId] is only changed by the backend with ID + * backendID and read by another backend that is asked to import a + * snapshot. + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum + * of a valid snapshot. + * Every comparision to the checksum while it is being written or + * deleted is okay to fail, so we don't need to take a lock at all. + */ + || strcmp(cksum, syncSnapshotChksums[backendId]) != 0) + { + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information"))); + }
That error message is good for the second, third, and fourth conditions: those
only happen when we're given something not generated by pg_export_snapshot().
The first and fifth tests deserve different error messages (perhaps "snapshot is
from a different database"; "cannot export and import a snapshot in the same
connection"). The sixth test is also a bit different; failure arises there when
the input is bogus or when the exporting transaction has ended. Failing to hold
that transaction open until all child transactions have imported might be a
common early mistake when using this feature. So, what about "snapshot not
found exported by any running transaction" for the sixth test?
+ + Assert(databaseId != InvalidOid); + + oldctx = MemoryContextSwitchTo(TopTransactionContext);
How about "MemoryContextAlloc(TopTransactionContext, ..." in place of the two
palloc calls and dropping this? Not sure.
+ + snapshot->xmin = parseXactFromText(&s, "xmi:"); + Assert(snapshot->xmin != InvalidTransactionId); + snapshot->xmax = parseXactFromText(&s, "xma:"); + Assert(snapshot->xmax != InvalidTransactionId); + + if (snapshot->copied) + { + int xcnt = parseIntFromText(&s, "xcnt:", 0); + /* + * Unfortunately CopySnapshot allocates just one large chunk of memory, + * and makes snapshot->xip then point past the end of the fixed-size + * snapshot data. That way we cannot just repalloc(snapshot->xip, ...). + * Neither can we just change the base address of the snapshot because + * this address might still be saved somewhere. + */ + if (snapshot->xcnt < xcnt) + snapshot->xip = palloc(xcnt * sizeof(TransactionId)); + } + + i = 0; + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId) + { + if (xid == GetTopTransactionIdIfAny()) + continue; + snapshot->xip[i++] = xid; + } + snapshot->xcnt = i; + + /* + * We only write "sof:1" if the snapshot overflowed. If not, then there is + * no "sof:x" entry at all and parseBoolFromText() will just return false. + */ + snapshot->suboverflowed = parseBoolFromText(&s, "sof:"); + + if (!snapshot->suboverflowed) + { + if (snapshot->copied) + { + int sxcnt = parseIntFromText(&s, "sxcnt:", 0); + if (snapshot->subxcnt < sxcnt) + snapshot->subxip = palloc(sxcnt * sizeof(TransactionId)); + } + + i = 0; + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId) + snapshot->subxip[i++] = xid; + snapshot->subxcnt = i; + } + + MemoryContextSwitchTo(oldctx); + + /* Leave snapshot->curcid as is. */ + + /* + * No ProcArrayLock held here, we assume that a write is atomic. Also note + * that MyProc->xmin can go backwards here. However this is safe because + * the xmin we set here is the same as in the backend's proc->xmin whose + * snapshot we are copying. At this very moment, anybody computing a + * minimum will calculate at least this xmin as the overall xmin with or + * without us setting MyProc->xmin to this value. + * Instead we must check to not go forward (we might have opened a cursor + * in this transaction and still have its snapshot registered) + */ + if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin)) + MyProc->xmin = snapshot->xmin;
The write is atomic, in the sense that this process cannot be interrupted with
the update in progress. You'd still need a memory barrier to make other
processes see the change immediately on all architectures we support. I'd
suggest taking ProcArrayLock and adding a comment to the effect that we could
make it lockless, should we ever adopt such methods. I don't advise testing
those waters as part of this patch.
+ + /* + * Check the checksum again to prevent a race condition. If the exporting + * backend invalidated its snapshot since we last checked or before we + * finish with this second check, then we fail here and error out, thus + * invalidating the snapshot we've built up. + * We could succeed here even though the exporting transaction is at the + * same time invalidating the checksum while we are checking it here for + * the second time. But even then we are good, because we have set up our + * MyProc record already. + */ + if (strcmp(cksum, syncSnapshotChksums[backendId]) != 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid data in snapshot information")));
I was trying to decide whether this is vulnerable to the ABA problem, and I
believe it is not. It is possible that the original exporting transaction
completed and a new transaction exported the same snapshot, all between our
first checksum comparison and now. However, this would also mean that no new
permanent transactions IDs got allocated in the gap. So I think we're good.
This message should continue to mirror the one from the earlier test, of course.
+ + return true; + } + + diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 45b92a0..f74ea76 100644
+ Datum + pg_import_snapshot(PG_FUNCTION_ARGS) + { + bytea *snapshotData = PG_GETARG_BYTEA_P(0); + bool ret = true; + + if (ActiveSnapshotSet()) + ret = ImportSnapshot(snapshotData, GetActiveSnapshot()); + + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot()); + + /* + * If we are in read committed mode then the next query will execute with a + * new snapshot making this function call quite useless. + */ + if (!IsolationUsesXactSnapshot()) + ereport(WARNING, + (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION), + errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE"), + errhint("A transaction of isolation level READ COMMITTED gives you a new snapshot for each query.")));
Again, REPEATABLE READ is a perfectly good isolation level for this use. The
test allows it, but the error message does not.
nm
Attachments:
Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?
On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch <noah@leadboat.com> wrote:
Just to clarify, I was envisioning something like:
typedef struct { bool valid; char value[16]; } snapshotChksum;
I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.
I didn't like it in the beginning but later on I adopted your
proposal. I have also changed the locking to be more natural. Even
though we don't really need it, I am now grabbing shared ProcArrayLock
for any reads of shared memory and exclusive lock for writes. Of
course no additional lock is taken if the feature is not used.
You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.
Yeah, that would do what you wanted to illustrate but it truely cannot
be considered the standard use case :-)
Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.Okay. Had no special reason to believe otherwise, just didn't know.
This is one part where I'd like someone more experienced than me look into it.
Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?
As discussed with Noah off-list this is just not an MVCC safe
operation. You could hit on this in a regular SERIALIZABLE transaction
as well: Somebody else can delete a table and you won't be able to
access it anymore. This is also the reason why pg_dump in the
beginning gets a shared lock on every table that it will dump later
on.
Thanks for the review Noah,
Joachim
Attachments:
syncsnapshots.3.difftext/x-patch; charset=US-ASCII; name=syncsnapshots.3.diffDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ed2039c..4169594 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** FOR EACH ROW EXECUTE PROCEDURE suppress_
*** 14761,14764 ****
--- 14761,14860 ----
<xref linkend="SQL-CREATETRIGGER">.
</para>
</sect1>
+
+ <sect1 id="functions-snapshotsync">
+ <title>Snapshot Synchronization Functions</title>
+
+ <indexterm>
+ <primary>pg_export_snapshot</primary>
+ </indexterm>
+ <indexterm>
+ <primary>pg_import_snapshot</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> allows different sessions to synchronize their
+ snapshots. A database snapshot determines which data is visible to
+ the client that is using this snapshot. Synchronized snapshots are necessary if
+ two clients need to see the same content in the database. If these two clients
+ just connected to the database and opened their transactions, then they could
+ never be sure that there was no data modification right between both
+ connections. To solve this, <productname>PostgreSQL</> offers the two functions
+ <function>pg_export_snapshot</> and <function>pg_import_snapshot</>.
+ </para>
+ <para>
+ The idea is that one client retrieves the information about the snapshot that it
+ is currently using and then the second client passes this information back to
+ the server. The database server will then modify the second client's snapshot
+ to match the snapshot of the first and from then on both can see the identical
+ data in the database.
+ </para>
+ <para>
+ Note that a snapshot can only be imported as long as the transaction that
+ exported it originally is held open. Also note that even after the
+ synchronization both clients still run their own independent transactions.
+ As a consequence, even though synchronized with respect to reading pre-existing
+ data, both transactions won't be able to see each other's uncommitted data.
+ </para>
+ <table id="functions-snapshot-synchronization">
+ <title>Snapshot Synchronization Functions</title>
+ <tgroup cols="3">
+ <thead>
+ <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry>
+ <literal><function>pg_export_snapshot()</function></literal>
+ </entry>
+ <entry><type>bytea</type></entry>
+ <entry>Export the snapshot and return its textual representation</entry>
+ </row>
+ <row>
+ <entry>
+ <literal><function>pg_import_snapshot(<parameter>snapshotData</> <type>bytea</>)</function></literal>
+ </entry>
+ <entry><type>boolean</type></entry>
+ <entry>Import a previously exported snapshot</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The function <function>pg_export_snapshot</> does not take an argument
+ and returns the current snapshot information as <type>bytea</type> data.
+ You should not assume any maximal length for this data. Also make sure that
+ the transaction that exports the snapshot will continue to live at least until
+ the snapshot has been imported into another connection. As soon as the
+ transaction ends, the exported snapshot becomes invalid and cannot be imported
+ anymore. You should not rely on what exactly the function returns. If you need
+ to access any transaction IDs of the current transaction, please refer to <xref
+ linkend="functions-txid-snapshot"> instead.
+ </para>
+ <programlisting>
+ BEGIN TRANSACTION READ ONLY;
+ SELECT pg_export_snapshot();
+ </programlisting>
+ <para>
+ <function>pg_import_snapshot</> takes an argument of type <type>bytea</type>
+ and returns a <type>boolean</type> value. Actually the function will
+ always either return true or throw an error. As the argument you need to pass
+ the exact value that has been returned by <function>pg_export_snapshot</>.
+ The function can only be used reasonably if your <literal>TRANSACTION ISOLATION
+ LEVEL</> (see <xref
+ linkend="transaction-iso">) is set to <literal>SERIALIZABLE</> (or
+ <literal>REPEATABLE READ</> which will implicitly use <literal>SERIALIZABLE</>).
+ The reason is that if your transaction is set to <literal>READ COMMITTED</>,
+ it will acquire a new snapshot for the next query and will not use the one that
+ has just been imported.
+ </para>
+ <programlisting>
+ BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SELECT pg_import_snapshot('\x6469643a31206269643a3320786d693a31303133313420786d613a3130313331342078636e743a30207378636e743a3020');
+ </programlisting>
+ </sect1>
</chapter>
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbac56..c96942a 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 ****
--- 124,130 ----
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
+ size = add_size(size, SyncSnapshotShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 228,233 ****
--- 229,235 ----
BTreeShmemInit();
SyncScanShmemInit();
AsyncShmemInit();
+ SyncSnapshotInit();
#ifdef EXEC_BACKEND
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8b36df4..5caddec 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 50,60 ****
--- 50,63 ----
#include "access/transam.h"
#include "access/xact.h"
#include "access/twophase.h"
+ #include "libpq/md5.h"
#include "miscadmin.h"
#include "storage/procarray.h"
#include "storage/spin.h"
#include "storage/standby.h"
#include "utils/builtins.h"
+ #include "utils/bytea.h"
+ #include "utils/memutils.h"
#include "utils/snapmgr.h"
*************** static int KnownAssignedXidsGetAndSetXmi
*** 158,163 ****
--- 161,175 ----
TransactionId xmax);
static TransactionId KnownAssignedXidsGetOldestXmin(void);
static void KnownAssignedXidsDisplay(int trace_level);
+ static void InvalidateExportedSnapshot(void);
+
+ typedef char snapshotChksumPlain[16];
+ typedef struct {
+ snapshotChksumPlain chksum;
+ bool valid;
+ } snapshotChksum;
+ static snapshotChksum *syncSnapshotChksums;
+ static Snapshot exportedSnapshot;
/*
* Report shared-memory space needed by CreateSharedProcArray.
*************** ProcArrayRemove(PGPROC *proc, Transactio
*** 350,355 ****
--- 362,373 ----
void
ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
{
+ /*
+ * Release any exported snapshots.
+ */
+ if (exportedSnapshot)
+ InvalidateExportedSnapshot();
+
if (TransactionIdIsValid(latestXid))
{
/*
*************** void
*** 415,420 ****
--- 433,445 ----
ProcArrayClearTransaction(PGPROC *proc)
{
/*
+ * Release any exported snapshots (Note that this will take exclusive
+ * ProcArrayLock if the feature has been used).
+ */
+ if (exportedSnapshot)
+ InvalidateExportedSnapshot();
+
+ /*
* We can skip locking ProcArrayLock here, because this action does not
* actually change anyone's view of the set of running XIDs: our entry is
* duplicate with the gxact that has already been inserted into the
*************** KnownAssignedXidsDisplay(int trace_level
*** 3065,3067 ****
--- 3090,3469 ----
pfree(buf.data);
}
+
+
+ /*
+ * Report space needed for our shared memory area, which is basically an
+ * md5 checksum per connection.
+ */
+ Size
+ SyncSnapshotShmemSize(void)
+ {
+ return PROCARRAY_MAXPROCS * sizeof(snapshotChksum);
+ }
+
+ void
+ SyncSnapshotInit(void)
+ {
+ Size size;
+ bool found;
+ int i;
+
+ size = SyncSnapshotShmemSize();
+
+ syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums",
+ size, &found);
+ if (!found)
+ for (i = 0; i < PROCARRAY_MAXPROCS; i++)
+ syncSnapshotChksums[i].valid = false;
+ }
+
+ static void
+ InvalidateExportedSnapshot(void)
+ {
+ /* See lock considerations at ImportSnapshot() */
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ syncSnapshotChksums[MyBackendId].valid = false;
+ LWLockRelease(ProcArrayLock);
+
+ UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner);
+ exportedSnapshot = NULL;
+ }
+
+ static void
+ snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...)
+ {
+ va_list ap;
+ int ret;
+
+ if (*bufsize_left < 64)
+ {
+ /* enlarge buffer by 1024 bytes */
+ *buffer = repalloc(*buffer, *bufsize_filled + 1024);
+ *bufsize_left = *bufsize_filled + 1024;
+ }
+
+ va_start(ap, fmt);
+ ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap);
+ va_end(ap);
+
+ /* There shouldn't be any error, we leave enough room for the data that we
+ * are writing (only numbers basically). */
+ Assert(ret < *bufsize_left && ret > 0);
+
+ *bufsize_left -= ret;
+ *bufsize_filled += ret;
+
+ Assert(strlen(*buffer) == *bufsize_filled);
+ }
+
+ bytea *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ #define SNAPSHOT_APPEND(x, y) \
+ (snapshot_appendf(&buf, &bufsize_filled, &bufsize_left, (x), (y)))
+ int bufsize = 1024;
+ int bufsize_filled = 0; /* doesn't include NUL byte */
+ int bufsize_left = bufsize - bufsize_filled;
+ char *buf = (char *) palloc(bufsize);
+ TransactionId *children;
+ int i;
+ int nchildren;
+
+ /* In a subtransaction we don't see our open subxip values in the snapshot
+ * so they would be missing in the backend applying it. */
+ if (GetCurrentTransactionNestLevel() != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ errmsg("cannot export a snapshot from a subtransaction")));
+
+ /* We do however see our already committed subxip values and add them to
+ * the subxip array. */
+ nchildren = xactGetCommittedChildren(&children);
+
+ /* Write up all the data that we return */
+ SNAPSHOT_APPEND("did:%d ", MyDatabaseId);
+ SNAPSHOT_APPEND("bid:%d ", MyBackendId);
+ SNAPSHOT_APPEND("xmi:%d ", snapshot->xmin);
+ SNAPSHOT_APPEND("xma:%d ", snapshot->xmax);
+ /* Include our own transaction ID into the count if any. */
+ SNAPSHOT_APPEND("xcnt:%d ", TransactionIdIsValid(GetTopTransactionIdIfAny()) ?
+ snapshot->xcnt + 1 : snapshot->xcnt);
+ for (i = 0; i < snapshot->xcnt; i++)
+ SNAPSHOT_APPEND("xip:%d ", snapshot->xip[i]);
+ /*
+ * Finally add our own XID if we have one, since by definition we will
+ * still be running when the other transaction takes over the snapshot.
+ */
+ if (TransactionIdIsValid(GetTopTransactionIdIfAny()))
+ SNAPSHOT_APPEND("xip:%d ", GetTopTransactionIdIfAny());
+ if (snapshot->suboverflowed || snapshot->subxcnt + nchildren > TOTAL_MAX_CACHED_SUBXIDS)
+ SNAPSHOT_APPEND("sof:%d ", 1);
+ else
+ {
+ SNAPSHOT_APPEND("sxcnt:%d ", snapshot->subxcnt + nchildren);
+ for (i = 0; i < snapshot->subxcnt; i++)
+ SNAPSHOT_APPEND("sxp:%d ", snapshot->subxip[i]);
+ /* Add already committed subtransactions. */
+ for (i = 0; i < nchildren; i++)
+ SNAPSHOT_APPEND("sxp:%d ", children[i]);
+ }
+
+ /*
+ * buf ends with a trailing space but we leave it in for simplicity. The
+ * parsing routines also depend on it.
+ */
+
+ /* Register the snapshot */
+ snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ exportedSnapshot = snapshot;
+
+ /* See lock considerations at ImportSnapshot() */
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ if (!pg_md5_binary(buf, bufsize_filled, syncSnapshotChksums[MyBackendId].chksum))
+ {
+ LWLockRelease(ProcArrayLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+ syncSnapshotChksums[MyBackendId].valid = true;
+ LWLockRelease(ProcArrayLock);
+
+ return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf)));
+ #undef SNAPSHOT_APPEND
+ }
+
+ static Oid
+ parseOidFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ Oid oid;
+
+ if (!p)
+ return InvalidOid;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidOid;
+ *n = '\0';
+ oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return oid;
+ }
+
+ static int
+ parseIntFromText(char **s, const char *prefix, int notfound)
+ {
+ char *n, *p = strstr(*s, prefix);
+ int i;
+
+ if (!p)
+ return notfound;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return notfound;
+ *n = '\0';
+ i = DatumGetInt32(DirectFunctionCall1(int4in, CStringGetDatum(p)));
+ *s = n + 1;
+ return i;
+ }
+
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ return (bool) parseIntFromText(s, prefix, 0);
+ }
+
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ char *n, *p = strstr(*s, prefix);
+ TransactionId xid;
+
+ if (!p)
+ return InvalidTransactionId;
+ p += strlen(prefix);
+ n = strchr(p, ' ');
+ if (!n)
+ return InvalidTransactionId;
+ *n = '\0';
+ xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p)));
+ *s = n + 1;
+ return xid;
+ }
+
+ /*
+ * Import a previously exported snapshot: First check the data, then compare
+ * the checksum with the exported snapshots. Finally update the snapshot that
+ * we get passed in.
+ *
+ * Lock considerations:
+ *
+ * syncSnapshotChksums[backendId] is only changed by the backend with ID
+ * backendID and read by another backend that is asked to import a snapshot.
+ *
+ * We'd not really need to take a lock because of the access pattern. Instead we
+ * need to take one because we might run on an architecture with weak memory
+ * ordering: Usually we first invalidate our exported snapshot and then release
+ * our ProcArray entry. A different backend however could see the updated
+ * ProcArray entry but still see our checksum as valid. This can be solved with
+ * a few shared locks but in order to make the access least awkward, we take
+ * exclusive lock on writes and shared lock on reads.
+ */
+ bool
+ ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
+ {
+ BackendId backendId;
+ snapshotChksumPlain cksum;
+ Oid databaseId;
+ int i;
+ int len = VARSIZE_ANY_EXHDR(snapshotData);
+ char *s = palloc(len + 1);
+ int sxcnt, xcnt;
+ TransactionId xid;
+
+ if (GetCurrentTransactionNestLevel() != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ errmsg("cannot import a snapshot into a subtransaction")));
+
+ if (len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+
+ strncpy(s, VARDATA_ANY(snapshotData), len);
+ s[len] = '\0';
+
+ if (!pg_md5_binary(s, len, (void *) &cksum))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ databaseId = parseOidFromText(&s, "did:");
+ backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId);
+
+ if (databaseId != MyDatabaseId)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot import snapshot from a different database")));
+
+ if (backendId == InvalidBackendId
+ /*
+ * Make sure backendId is in a reasonable range before using it as an
+ * array subscript.
+ */
+ || backendId >= PROCARRAY_MAXPROCS
+ || backendId < 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+ }
+
+ if (backendId == MyBackendId)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot export and import a snapshot in the same connection")));
+
+ /*
+ * Verify that the checksum matches before doing any more work. We will
+ * later verify again to make sure that the exporting transaction has not
+ * yet terminated by then. We don't want to optimize this into just one
+ * verification call at the very end because the instructions that follow
+ * this comment rely on a sane format of the textual snapshot data. In
+ * particular they assume that there are not more XactIds than
+ * advertised...
+ */
+ if (syncSnapshotChksums[backendId].valid == false
+ || memcmp(cksum, syncSnapshotChksums[backendId].chksum,
+ sizeof(syncSnapshotChksums[backendId].chksum)) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("snapshot not found exported by any running transaction")));
+
+ Assert(databaseId != InvalidOid);
+
+ snapshot->xmin = parseXactFromText(&s, "xmi:");
+ Assert(snapshot->xmin != InvalidTransactionId);
+ snapshot->xmax = parseXactFromText(&s, "xma:");
+ Assert(snapshot->xmax != InvalidTransactionId);
+
+ xcnt = parseIntFromText(&s, "xcnt:", 0);
+ /*
+ * Unfortunately CopySnapshot allocates just one large chunk of memory, and
+ * makes snapshot->xip then point past the end of the fixed-size snapshot
+ * data. That way we cannot just repalloc(snapshot->xip, ...). Neither can
+ * we just change the base address of the snapshot because this address
+ * might still be saved somewhere.
+ */
+ if (snapshot->copied && snapshot->xcnt < xcnt)
+ snapshot->xip = MemoryContextAlloc(TopTransactionContext,
+ xcnt * sizeof(TransactionId));
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ {
+ if (xid == GetTopTransactionIdIfAny())
+ continue;
+ snapshot->xip[i++] = xid;
+ }
+ snapshot->xcnt = i;
+ Assert(snapshot->xcnt == xcnt);
+
+ /*
+ * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ * no "sof:x" entry at all and parseBoolFromText() will just return false.
+ */
+ snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
+
+ if (!snapshot->suboverflowed)
+ {
+ sxcnt = parseIntFromText(&s, "sxcnt:", 0);
+ if (snapshot->copied && snapshot->subxcnt < sxcnt)
+ snapshot->subxip = MemoryContextAlloc(TopTransactionContext,
+ sxcnt * sizeof(TransactionId));
+
+ i = 0;
+ while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ snapshot->subxip[i++] = xid;
+ snapshot->subxcnt = i;
+ Assert(snapshot->subxcnt == sxcnt);
+ }
+
+ /* Leave snapshot->curcid as is. */
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ /*
+ * Note that MyProc->xmin can go backwards here. However this is safe
+ * because the xmin we set here is the same as in the backend's proc->xmin
+ * whose snapshot we are copying. At this very moment, anybody computing a
+ * minimum will calculate at least this xmin as the overall xmin with or
+ * without us setting MyProc->xmin to this value.
+ * Instead we must check to not go forward (we might have opened a cursor
+ * in this transaction and still have its snapshot registered)
+ */
+ if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
+ MyProc->xmin = snapshot->xmin;
+
+ /*
+ * Check the checksum again to prevent a race condition. If the exporting
+ * backend invalidated its snapshot since we last checked then we fail here
+ * and error out, thus invalidating the snapshot we've built up.
+ */
+ if (syncSnapshotChksums[backendId].valid == false
+ || memcmp(cksum, syncSnapshotChksums[backendId].chksum,
+ sizeof(syncSnapshotChksums[backendId].chksum)) != 0)
+ {
+ LWLockRelease(ProcArrayLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid data in snapshot information")));
+ }
+ LWLockRelease(ProcArrayLock);
+
+ return true;
+ }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 45b92a0..c5ef1ab 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
*************** AtEOXact_Snapshot(bool isCommit)
*** 559,561 ****
--- 559,593 ----
FirstSnapshotSet = false;
registered_xact_snapshot = false;
}
+
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ PG_RETURN_BYTEA_P(snapshotData);
+ }
+
+ Datum
+ pg_import_snapshot(PG_FUNCTION_ARGS)
+ {
+ bytea *snapshotData = PG_GETARG_BYTEA_P(0);
+ bool ret = true;
+
+ if (ActiveSnapshotSet())
+ ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
+
+ ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
+
+ /*
+ * If we are in read committed mode then the next query will execute with a
+ * new snapshot making this function call quite useless.
+ */
+ if (!IsolationUsesXactSnapshot())
+ ereport(WARNING,
+ (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION),
+ errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE/REPEATABLE READ"),
+ errhint("A transaction of isolation level READ COMMITTED/READ UNCOMMITED gives you a new snapshot for each query.")));
+
+ PG_RETURN_BOOL(ret);
+ }
+
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..3a4bc6d 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 3391,3396 ****
--- 3391,3400 ----
DESCR("cancel a server process' current query");
DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
DESCR("terminate a server process");
+ DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
+ DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ ));
+ DESCR("import a snapshot");
DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
DESCR("prepare for taking an online backup");
DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 334f9a2..9f47a2e 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void XidCacheRemoveRunningXids(Tr
*** 71,74 ****
--- 71,80 ----
int nxids, const TransactionId *xids,
TransactionId latestXid);
+ extern Size SyncSnapshotShmemSize(void);
+ extern void SyncSnapshotInit(void);
+
+ extern bytea *ExportSnapshot(Snapshot snapshot);
+ extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot);
+
#endif /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 6d784d2..3dcb924 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern void AtSubAbort_Snapshot(int leve
*** 43,46 ****
--- 43,49 ----
extern void AtEarlyCommit_Snapshot(void);
extern void AtEOXact_Snapshot(bool isCommit);
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern Datum pg_import_snapshot(PG_FUNCTION_ARGS);
+
#endif /* SNAPMGR_H */
On Sun, Jan 30, 2011 at 12:36:12PM -0500, Joachim Wieland wrote:
Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?
No major opinion here. The second best fit after a new section would be System
Administration Functions, inasmuch as that's the bucket for miscellaneous
functions. I have an idea for improving structure here, but that would go
beyond the purview of this patch.
This version looks sound; I've marked it Ready for Committer.
Thanks,
nm
Looking into this patch.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I have two questions:
1. why are you using the expansible char array stuff instead of using
the StringInfo facility?
2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
1. why are you using the expansible char array stuff instead of using
the StringInfo facility?2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
We don't need a cryptographically secure hash.
There is no special reason for why it is like it is, I just didn't
think of the better alternatives that you are proposing.
Should I send an updated patch? Anything else?
Thanks for the review,
Joachim
Excerpts from Joachim Wieland's message of vie feb 18 21:41:51 -0300 2011:
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:1. why are you using the expansible char array stuff instead of using
the StringInfo facility?2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?We don't need a cryptographically secure hash.
Ok.
Should I send an updated patch? Anything else?
No need, I'm halfway through already.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On fre, 2011-02-18 at 16:57 -0300, Alvaro Herrera wrote:
2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
MD5 is probably more appropriate than hash_any, because the latter is
optimized for speed and collision avoidance and doesn't have a
guaranteed external format. The only consideration against MD5 might be
that it would make us look quite lame. We should probably provide
builtin SHA1 and SHA2 functions for this and other reasons.
On Sat, Feb 19, 2011 at 9:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
The only consideration against MD5 might be
that it would make us look quite lame. We should probably provide
builtin SHA1 and SHA2 functions for this and other reasons.
In this particular case however the checksum is never exposed to the
user but stays within shared memory. So nobody would notice that we
are still using lame old md5sum :-)
Joachim
Peter Eisentraut <peter_e@gmx.net> writes:
On fre, 2011-02-18 at 16:57 -0300, Alvaro Herrera wrote:
2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?
MD5 is probably more appropriate than hash_any, because the latter is
optimized for speed and collision avoidance and doesn't have a
guaranteed external format. The only consideration against MD5 might be
that it would make us look quite lame.
Only to people who don't understand whether crypto strength is actually
important in a given use-case.
However ... IIRC, hash_any gives different results on bigendian and
littleendian machines. I'm not sure if a predictable cross-platform
result is important for this use? If you're hashing data containing
native integers, this is a problem anyway.
regards, tom lane
Excerpts from Tom Lane's message of sáb feb 19 21:26:42 -0300 2011:
However ... IIRC, hash_any gives different results on bigendian and
littleendian machines. I'm not sure if a predictable cross-platform
result is important for this use? If you're hashing data containing
native integers, this is a problem anyway.
The current feature only lets you synchronize snapshots within a cluster
-- you can't ship them to another machine. I don't think dependency on
endianness is a problem here. (But no, the data doesn't contain native
integers -- it's the snapshot's text representation.)
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
A couple more questions:
What's the reason for this restriction?
if (databaseId != MyDatabaseId)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("cannot import snapshot from a different database")));
Why are we using bytea as the output format instead of text? The output
is just plain text anyway, as can be seen by setting bytea_output to
'escape'. Perhaps there would be a value to using bytea if we were to
pglz_compress the result, but would there be a point in doing so?
Normally exported info should be short enough that it would cause more
overhead than it would save anyway.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Joachim Wieland's message of dom ene 30 14:36:12 -0300 2011:
On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch <noah@leadboat.com> wrote:
Is it valid to scribble directly on snapshots like this?
I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.Okay. Had no special reason to believe otherwise, just didn't know.
This is one part where I'd like someone more experienced than me look into it.
Yeah, I don't like this bit very much. I'm inclined to have the import
routine fill CurrentSnapshot directly (and perhaps ActiveSnapshot too,
not sure about this part yet) instead; I think we need a safety net so
that the new serializable isolation code doesn't get upset if we change
the base snapshot from under it, but I haven't looked at that yet.
Trying to import a snap into a transaction that has committed children
should also fail; otherwise, objects touched during those subxacts would
be in a spooky zone. Also, I want to have Importing into a
read-committed transaction fail with an error instead of the current
warning -- just like LOCK TABLE fails if you're not inside a
transaction.
I'm also inclined to have PREPARE TRANSACTION fail if it has an exported
snapshot, instead of just invalidating it. This lets us get rid of the
extra locking added during that operation.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hi,
On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
What's the reason for this restriction?
if (databaseId != MyDatabaseId)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("cannot import snapshot from a different database")));
I just couldn't think of a use case for it and so didn't want to
introduce a feature that we might have to support in the future then.
Why are we using bytea as the output format instead of text? The output
is just plain text anyway, as can be seen by setting bytea_output to
'escape'. Perhaps there would be a value to using bytea if we were to
pglz_compress the result, but would there be a point in doing so?
Normally exported info should be short enough that it would cause more
overhead than it would save anyway.
It is bytea because it should be thought of "just some data". It
should be regarded more as a token than as text and should not be
inspected/interpreted at all. If anybody decides to do so, fine, but
then they should not rely on the stability of the message format for
the future.
Joachim
Joachim Wieland <joe@mcknight.de> writes:
On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Why are we using bytea as the output format instead of text?
It is bytea because it should be thought of "just some data". It
should be regarded more as a token than as text and should not be
inspected/interpreted at all. If anybody decides to do so, fine, but
then they should not rely on the stability of the message format for
the future.
I don't think that passing it as bytea conveys those semantics at all.
It just makes it harder to look at the value for debugging purposes.
Plus you gotta worry about variable formatting/escaping conventions
(bytea_output etc) that could easily be avoided if the value were
promised to be text with no funny characters in it.
regards, tom lane
Alvaro Herrera wrote:
I think we need a safety net so that the new serializable isolation
code doesn't get upset if we change the base snapshot from under
it, but I haven't looked at that yet.
Replacing the snapshot for a serializable transaction after it has
acquired its initial snapshot would quietly allow non-serializable
behavior, I would think. I don't think that setting a snapshot for a
SERIALIZABLE READ ONLY DEFERRABLE transaction makes any sense, since
the point of that is that it waits for a snapshot which meets certain
criteria to be available; setting a snapshot in that mode should
probably just be disallowed. Otherwise, if you set the snapshot
before the transaction acquires one through normal means, I can't
think of any problems -- just make sure you set FirstSnapshotSet.
-Kevin
Import Notes
Resolved by subject fallback
Excerpts from Kevin Grittner's message of lun feb 21 18:39:26 -0300 2011:
Alvaro Herrera wrote:
I think we need a safety net so that the new serializable isolation
code doesn't get upset if we change the base snapshot from under
it, but I haven't looked at that yet.Replacing the snapshot for a serializable transaction after it has
acquired its initial snapshot would quietly allow non-serializable
behavior, I would think. I don't think that setting a snapshot for a
SERIALIZABLE READ ONLY DEFERRABLE transaction makes any sense, since
the point of that is that it waits for a snapshot which meets certain
criteria to be available; setting a snapshot in that mode should
probably just be disallowed. Otherwise, if you set the snapshot
before the transaction acquires one through normal means, I can't
think of any problems -- just make sure you set FirstSnapshotSet.
Actually this seems rather difficult to do, because in order to invoke
the function that imports the snapshot, you have to call SELECT, which
acquires a snapshot beforehand. So when we actually import the
passed-in snapshot, there's already a snapshot set. This is odious but
I don't see a way around that -- other than adding special grammar
support which seems overkill.
I've been thinking in requiring that snapshot to have refcount==1, but
I'm not sure if that works.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Actually this seems rather difficult to do, because in order to invoke
the function that imports the snapshot, you have to call SELECT, which
acquires a snapshot beforehand. So when we actually import the
passed-in snapshot, there's already a snapshot set. This is odious but
I don't see a way around that -- other than adding special grammar
support which seems overkill.
No, I don't think it is. The alternative is semantics that are
at least exceedingly ugly, and very possibly flat-out broken.
To take just one point, you have no way at all of preventing the
transaction from having done something else using its original
snapshot.
regards, tom lane
Excerpts from Tom Lane's message of lun feb 21 21:00:19 -0300 2011:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Actually this seems rather difficult to do, because in order to invoke
the function that imports the snapshot, you have to call SELECT, which
acquires a snapshot beforehand. So when we actually import the
passed-in snapshot, there's already a snapshot set. This is odious but
I don't see a way around that -- other than adding special grammar
support which seems overkill.No, I don't think it is. The alternative is semantics that are
at least exceedingly ugly, and very possibly flat-out broken.
To take just one point, you have no way at all of preventing the
transaction from having done something else using its original
snapshot.
That's true too. So let's discuss the syntax. Maybe
START TRANSACTION SNAPSHOT '\xdeadbeef';
This kind of extension seems ugly though; maybe we should consider
START TRANSACTION (snapshot='\xdeadbeef');
(like VACUUM, EXPLAIN and COPY) or some such?
I don't think we need another "top-level" command for this.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
That's true too. So let's discuss the syntax. Maybe
START TRANSACTION SNAPSHOT '\xdeadbeef';
This kind of extension seems ugly though; maybe we should consider
START TRANSACTION (snapshot='\xdeadbeef');
(like VACUUM, EXPLAIN and COPY) or some such?
+1 for the latter, mainly because (I think) you could avoid making
SNAPSHOT into an actual parser keyword that way.
I don't think we need another "top-level" command for this.
Mmm ... one possible objection is that if it's hard-wired into the START
TRANSACTION syntax, then there is no way to take any locks before
establishing the snapshot. Right offhand, though, I can't see a
use-case for doing that, given that the snapshot itself would be older
than any locks that the secondary transaction could grab.
I suppose that a separate top-level command might be nicer just to avoid
touching the syntax of START TRANSACTION. If we put the option into
parens as above, there seems little chance of colliding with future spec
extensions, but it's a tad ugly looking.
regards, tom lane
On 19.02.2011 02:41, Joachim Wieland wrote:
On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:1. why are you using the expansible char array stuff instead of using
the StringInfo facility?2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?We don't need a cryptographically secure hash.
Really? The idea of the hash is to prevent you from importing arbitrary
snapshots into the system, allowing only copying snapshots that are in
use by another backend. The purpose of the hash is to make sure the
snapshot the client passes in is identical to one used by another backend.
If you don't use a cryptographically secure hash, it's easy to construct
a snapshot with the same hash as an existing snapshot, with more or less
arbitrary contents.
This also makes me worried:
+ + /* + * Verify that the checksum matches before doing any more work. We will + * later verify again to make sure that the exporting transaction has not + * yet terminated by then. We don't want to optimize this into just one + * verification call at the very end because the instructions that follow + * this comment rely on a sane format of the textual snapshot data. In + * particular they assume that there are not more XactIds than + * advertised... + */
It sounds like you risk a buffer overflow if the snapshot is bogus,
which makes a collision-resistant hash even more important.
I know this was discussed already, but I don't much like using a hash
like this. We should find a way to just store the whole snapshot in
shared memory, or even a temporary file if we want to skimp on shared
memory. It's generally better to not rely on cryptography when you don't
have to.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 21.02.2011 21:33, Joachim Wieland wrote:
Hi,
On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:What's the reason for this restriction?
if (databaseId != MyDatabaseId)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("cannot import snapshot from a different database")));I just couldn't think of a use case for it and so didn't want to
introduce a feature that we might have to support in the future then.
Hmm, here's one: getting a consistent snapshot of all databases in
pg_dumpall. Not sure how useful that is..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 22, 2011 at 1:59 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Really? The idea of the hash is to prevent you from importing arbitrary
snapshots into the system, allowing only copying snapshots that are in use
by another backend. The purpose of the hash is to make sure the snapshot the
client passes in is identical to one used by another backend.
If that's the purpose of the hash, it is utterly useless. The
computation performed by the backend can easily be replicated by an
adversary.
If you don't use a cryptographically secure hash, it's easy to construct a
snapshot with the same hash as an existing snapshot, with more or less
arbitrary contents.
And if you did use a cryptographically secure hash, it would still be
easy, unless there is some plausible way of keeping the key a secret,
which I doubt.
The original reason why we went with this design (send the snapshot to
the client and then have the client send it back to the server) was
because some people thought it could be useful to take a snapshot from
the master and recreate it on a standby, or perhaps the other way
around. If that's the goal, checking whether the snapshot being
imported is one that's already in use is missing the point. We have
to rely on our ability to detect, when importing the snapshot, any
anomalies that could lead to system instability; and allow people to
import an arbitrary snapshot that meets those constraints, even one
that the user just created out of whole cloth.
Now, as I said before, I think that's a bad idea. I think we should
NOT be designing the system to allow publication of arbitrary
snapshots; instead, the backend that wishes to export its snapshot
should so request, getting back a token (like "1") that other backends
can then pass to START TRANSACTION (SNAPSHOT '1'), or whatever syntax
we end up using. In that design, there's no need to validate
snapshots because they never leave the server's memory space, which
IMHO is as it should be. If someone can develop a convincing argument
that transmitting snapshots between the master and standby is a good
idea, we can still accommodate that: the master-standby connection is
now full-duplex. In fact, we may want to do that anyway for other
reasons (see Kevin Grittner's musings on this point vs. true
serializability).
Another reason I don't like this approach is because it's possible
that the representation of snapshots might change in the future. Tom
mused a while back about using an LSN as a snapshot (snapshots sees
all transactions committed prior to that LSN) and there are other
plausible changes, too. If we expose the internal details of the
snapshot mechanism to users, then (1) it becomes a user-visible
feature with backward compatibility implications if we choose to break
it down the road and (2) all future snapshot representations must be
able to be fully sanity checked on import. Right now it's fairly easy
to verify that the snapshot's xmin isn't too old and that the
remaining XIDs are in a reasonable range and that's probably all we
really need to be certain of, but it's not clear that some future
representation would be equally easy to validate.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 22.02.2011 14:24, Robert Haas wrote:
On Tue, Feb 22, 2011 at 1:59 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:If you don't use a cryptographically secure hash, it's easy to construct a
snapshot with the same hash as an existing snapshot, with more or less
arbitrary contents.And if you did use a cryptographically secure hash, it would still be
easy, unless there is some plausible way of keeping the key a secret,
which I doubt.
This is hashing, not encryption, there is no key. The point is that even
if the attacker has the hash value and knows the algorithm, he can not
construct *another* snapshot that has the same hash. A cryptographically
strong hash function has that property, second preimage resistance,
whereas for a regular CRC or similar it's easy to create an arbitrary
input that has any given checksum.
The original reason why we went with this design (send the snapshot to
the client and then have the client send it back to the server) was
because some people thought it could be useful to take a snapshot from
the master and recreate it on a standby, or perhaps the other way
around. If that's the goal, checking whether the snapshot being
imported is one that's already in use is missing the point. We have
to rely on our ability to detect, when importing the snapshot, any
anomalies that could lead to system instability; and allow people to
import an arbitrary snapshot that meets those constraints, even one
that the user just created out of whole cloth.
Yes. It would be good to perform those sanity checks anyway.
Now, as I said before, I think that's a bad idea. I think we should
NOT be designing the system to allow publication of arbitrary
snapshots; instead, the backend that wishes to export its snapshot
should so request, getting back a token (like "1") that other backends
can then pass to START TRANSACTION (SNAPSHOT '1'), or whatever syntax
we end up using.
Well, I'm not sure if we should allow importing arbitrary (as long as
they're sane) snapshots or not; the arguments for and against that are
both compelling.
But even if we don't allow it, there's no harm in sending the whole
snapshot to the client, anyway. Ie. instead of "1" as the identifier,
use the snapshot itself. That leaves the door open for allowing it in
the future, should we choose to do so.
In that design, there's no need to validate
snapshots because they never leave the server's memory space, which
IMHO is as it should be.
I agree that if we don't allow importing arbitrary snapshots, we should
use some out-of-band communication to get the snapshot to the backend.
IOW not rely on a hash.
Another reason I don't like this approach is because it's possible
that the representation of snapshots might change in the future. Tom
mused a while back about using an LSN as a snapshot (snapshots sees
all transactions committed prior to that LSN) and there are other
plausible changes, too. If we expose the internal details of the
snapshot mechanism to users, then (1) it becomes a user-visible
feature with backward compatibility implications if we choose to break
it down the road
Clients should treat the snapshot as an opaque block.
and (2) all future snapshot representations must be
able to be fully sanity checked on import. Right now it's fairly easy
to verify that the snapshot's xmin isn't too old and that the
remaining XIDs are in a reasonable range and that's probably all we
really need to be certain of, but it's not clear that some future
representation would be equally easy to validate.
Perhaps, although I can't imagine a representation that wouldn't be
equally easy to validate.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
This is hashing, not encryption, there is no key. The point is that even if
the attacker has the hash value and knows the algorithm, he can not
construct *another* snapshot that has the same hash.
What good does that do us?
Yes. It would be good to perform those sanity checks anyway.
I don't think it's good; I think it's absolutely necessary. Otherwise
someone can generate arbitrary garbage, hash it, and feed it to us.
No?
But even if we don't allow it, there's no harm in sending the whole snapshot
to the client, anyway. Ie. instead of "1" as the identifier, use the
snapshot itself. That leaves the door open for allowing it in the future,
should we choose to do so.
The door is open either way, AFAICS: we could eventually allow:
BEGIN TRANSACTION (SNAPSHOT '1');
and also
BEGIN TRANSACTION (SNAPSHOT '{xmin 123 xmax 456 xids 128 149 201}');
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 22.02.2011 15:52, Robert Haas wrote:
On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Yes. It would be good to perform those sanity checks anyway.
I don't think it's good; I think it's absolutely necessary. Otherwise
someone can generate arbitrary garbage, hash it, and feed it to us.
No?
No, the hash is stored in shared memory. The hash of the garbage has to
match.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 22.02.2011 15:52, Robert Haas wrote:
On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Yes. It would be good to perform those sanity checks anyway.
I don't think it's good; I think it's absolutely necessary. Otherwise
someone can generate arbitrary garbage, hash it, and feed it to us.
No?No, the hash is stored in shared memory. The hash of the garbage has to
match.
Oh. Well that's really silly. At that point you might as well just
store the snapshot and an integer identifier in shared memory, right?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 22.02.2011 16:29, Robert Haas wrote:
On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 22.02.2011 15:52, Robert Haas wrote:
On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Yes. It would be good to perform those sanity checks anyway.
I don't think it's good; I think it's absolutely necessary. Otherwise
someone can generate arbitrary garbage, hash it, and feed it to us.
No?No, the hash is stored in shared memory. The hash of the garbage has to
match.Oh. Well that's really silly. At that point you might as well just
store the snapshot and an integer identifier in shared memory, right?
Yes, that's the point I was trying to make. I believe the idea of a hash
was that it takes less memory than storing the whole snapshot (and more
importantly, a fixed amount of memory per snapshot). But I'm not
convinced either that dealing with a hash is any less troublesome.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 22, 2011 at 9:34 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Oh. Well that's really silly. At that point you might as well just
store the snapshot and an integer identifier in shared memory, right?Yes, that's the point I was trying to make. I believe the idea of a hash was
that it takes less memory than storing the whole snapshot (and more
importantly, a fixed amount of memory per snapshot). But I'm not convinced
either that dealing with a hash is any less troublesome.
OK, sorry for taking a while to get the point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 22.02.2011 16:29, Robert Haas wrote:
On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:No, the hash is stored in shared memory. The hash of the garbage has to
match.Oh. Well that's really silly. At that point you might as well just
store the snapshot and an integer identifier in shared memory, right?Yes, that's the point I was trying to make. I believe the idea of a hash was
that it takes less memory than storing the whole snapshot (and more
importantly, a fixed amount of memory per snapshot). But I'm not convinced
either that dealing with a hash is any less troublesome.
Both Tom and Robert voted quite explicitly against the
store-in-shared-memory idea. Others don't want to allow people request
arbitrary snapshots and again others wanted to pass the snapshot
through the client so that in the future we could also request
snapshots from standby servers. The hash idea seemed to at least make
nobody unhappy.
For easier review, here are a few links to the previous discusion:
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.php
Why exactly, Heikki do you think the hash is more troublesome? And how
could we validate/invalidate snapshots without the checksum (assuming
the through-the-client approach instead of storing the whole snapshot
in shared memory)?
Joachim
On Tue, Feb 22, 2011 at 8:00 PM, Joachim Wieland <joe@mcknight.de> wrote:
Both Tom and Robert voted quite explicitly against the
store-in-shared-memory idea.
No, I voted *for* that approach.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
The current discussion seems to have reached the same conclusion as the
last one: the snapshot info shouldn't leave the server, but be
transmitted internally -- the idea of the temp file seems prevalent.
Here's an attempt at a detailed spec:
By invoking pg_export_snapshot(), a transaction can export its current
transaction snapshot.
To export a snapshot, the transaction creates a temp file containing all
information about the snapshot, including the BackendId and
TransactionId of the creating transaction. The file name is determined
internally by the exporting transaction, and returned by the function
that creates the snapshot. The act of exporting a snapshot causes a
TransactionId to be acquired, if there is none.
Care is taken that no existing snapshot file is ever overwritten, to
prevent security problems. A transaction may export any number of
snapshots.
The "cookie" to be passed around, then, is a temp file identifier. This
cookie gets passed as START TRANSACTION (snapshot='foo'), which sets the
transaction's snapshot to the one determined by the identifier. The
importing transaction must have isolation level serializable or
repeatable read declared on the same START TRANSACTION command; an
attempt to import a snapshot into a read committed transaction or one
with unspecified isolation level causes the transaction to get into
aborted state (so you still need to say ROLLBACK to get out of it. This
is necessary to avoid the session to proceed involuntarily with the
wrong snapshot.)
Similarly, if the snapshot is not available, an error is raised and the
transaction block remains in aborted state. This can happen if the
originating transaction ended after you opened the file, for example.
The BackendId and TransactionId of the exporting transaction let the
importing backend determine the validity of the snapshot beyond the mere
existence of the file.
The snapshot temp file is to be marked for deletion on transaction end.
All snapshot temp files are also deleted on crash recovery, to prevent
dangling files (I need some help here to determine how this plays with
hot standby.)
Privileges: Any role can export or import a snapshot. The rationale
here is that exporting a snapshot doesn't cause any more harm than
holding a transaction open, so if you let your users do that, then
there's no extra harm. Similarly, there's no extra privilege given to a
role that imports a snapshot that cannot be obtained by sending a START
TRANSACTION command at the right time.
Note: this proposal doesn't mention restrictions on having
subtransactions in the exporting transaction. This is because I haven't
figured them out, not because I think there shouldn't be any.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Feb 22, 2011 at 8:55 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
The current discussion seems to have reached the same conclusion as the
last one: the snapshot info shouldn't leave the server, but be
transmitted internally -- the idea of the temp file seems prevalent.
Here's an attempt at a detailed spec:By invoking pg_export_snapshot(), a transaction can export its current
transaction snapshot.To export a snapshot, the transaction creates a temp file containing all
information about the snapshot, including the BackendId and
TransactionId of the creating transaction. The file name is determined
internally by the exporting transaction, and returned by the function
that creates the snapshot. The act of exporting a snapshot causes a
TransactionId to be acquired, if there is none.Care is taken that no existing snapshot file is ever overwritten, to
prevent security problems. A transaction may export any number of
snapshots.The "cookie" to be passed around, then, is a temp file identifier. This
cookie gets passed as START TRANSACTION (snapshot='foo'), which sets the
transaction's snapshot to the one determined by the identifier. The
importing transaction must have isolation level serializable or
repeatable read declared on the same START TRANSACTION command; an
attempt to import a snapshot into a read committed transaction or one
with unspecified isolation level causes the transaction to get into
aborted state (so you still need to say ROLLBACK to get out of it. This
is necessary to avoid the session to proceed involuntarily with the
wrong snapshot.)Similarly, if the snapshot is not available, an error is raised and the
transaction block remains in aborted state. This can happen if the
originating transaction ended after you opened the file, for example.
The BackendId and TransactionId of the exporting transaction let the
importing backend determine the validity of the snapshot beyond the mere
existence of the file.The snapshot temp file is to be marked for deletion on transaction end.
All snapshot temp files are also deleted on crash recovery, to prevent
dangling files (I need some help here to determine how this plays with
hot standby.)
I think this can be done within StartupXLOG(), right around where we
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP).
Privileges: Any role can export or import a snapshot. The rationale
here is that exporting a snapshot doesn't cause any more harm than
holding a transaction open, so if you let your users do that, then
there's no extra harm. Similarly, there's no extra privilege given to a
role that imports a snapshot that cannot be obtained by sending a START
TRANSACTION command at the right time.
Agree.
Note: this proposal doesn't mention restrictions on having
subtransactions in the exporting transaction. This is because I haven't
figured them out, not because I think there shouldn't be any.
I think it's probably sufficient to say that a snapshot can only be
exported from the toplevel transaction.
Overall, this proposal is fine with me. But then I wasn't the one who
complained about it last time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 23.02.2011 03:00, Joachim Wieland wrote:
On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Yes, that's the point I was trying to make. I believe the idea of a hash was
that it takes less memory than storing the whole snapshot (and more
importantly, a fixed amount of memory per snapshot). But I'm not convinced
either that dealing with a hash is any less troublesome.Both Tom and Robert voted quite explicitly against the
store-in-shared-memory idea. Others don't want to allow people request
arbitrary snapshots and again others wanted to pass the snapshot
through the client so that in the future we could also request
snapshots from standby servers. The hash idea seemed to at least make
nobody unhappy.For easier review, here are a few links to the previous discusion:
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.phpWhy exactly, Heikki do you think the hash is more troublesome?
It just feels wrong to rely on cryptography just to save some shared memory.
I realize that there are conflicting opinions on this, but from user
point-of-view the hash is just a variant of the idea of passing the
snapshot through shared memory, just implemented in an indirect way.
And how
could we validate/invalidate snapshots without the checksum (assuming
the through-the-client approach instead of storing the whole snapshot
in shared memory)?
Either you accept anything that passes sanity checks, or you store the
whole snapshot in shared memory (or a temp file). I'm not sure which is
better, but they both seem better than the hash.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Why exactly, Heikki do you think the hash is more troublesome?
It just feels wrong to rely on cryptography just to save some shared memory.
Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past. Furthermore, we
can easily invalidate a snapshot that we have published earlier by
deleting its checksum in shared memory as soon as the original
transaction commits/aborts. And for these two a checksum seems to be a
good fit. Saving memory then comes as a benefit and makes all those
happy who don't want to argue about how many slots to reserve in
shared memory or don't want to have another GUC for what will probably
be a low-usage feature.
I realize that there are conflicting opinions on this, but from user
point-of-view the hash is just a variant of the idea of passing the snapshot
through shared memory, just implemented in an indirect way.
The user will never see the hash, why should he bother? The user point
of view is that he receives data and can obtain the same snapshot if
he passed that data back. This user experience is no different from
any other way of passing the snapshot through the client. And from the
previous discussions this seemed to be what most people wanted.
And how
could we validate/invalidate snapshots without the checksum (assuming
the through-the-client approach instead of storing the whole snapshot
in shared memory)?Either you accept anything that passes sanity checks, or you store the whole
snapshot in shared memory (or a temp file). I'm not sure which is better,
but they both seem better than the hash.
True, both might work but I don't see a real technical advantage over
the checksum approach for any of them, rather the opposite.
Nobody has come up with a use case for the accept-anything option so
far, so I don't see why we should commit ourselves to this feature at
this point, given that we have a cheap and easy way of
validating/invalidating snapshots. And I might be just paranoid but I
also fear that someone could raise security issues for the fact that
you would be able to request an arbitrary database state from the past
and inspect changes of other peoples' transactions. We might want to
allow that later though and I realize that we have to allow it for a
standby server that would take over a snapshot from the master anyway,
but I don't want to add this complexity into this first patch. I want
however be able to possibly allow this in the future without touching
the external API of the feature.
And for the tempfile approach, I don't see that the creation and
removal of the temp file is any less code complexity than flipping a
number in shared memory. Also it seemed that people rather wanted to
go with the through-the-client approach because it seems to be more
flexible.
Maybe you should just look at it as a conservative accept-anything
approach that uses a checksum to allow only snapshots that we know
have existed and have been published. Does the checksum still look so
weird from this perspective?
Joachim
On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe@mcknight.de> wrote:
On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Why exactly, Heikki do you think the hash is more troublesome?
It just feels wrong to rely on cryptography just to save some shared memory.
Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past. Furthermore, we
can easily invalidate a snapshot that we have published earlier by
deleting its checksum in shared memory as soon as the original
transaction commits/aborts. And for these two a checksum seems to be a
good fit. Saving memory then comes as a benefit and makes all those
happy who don't want to argue about how many slots to reserve in
shared memory or don't want to have another GUC for what will probably
be a low-usage feature.
But you can do all of this with files too, can't you? Just remove or
truncate the file when the snapshot is no longer valid.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe@mcknight.de> wrote:
Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past.
But you can do all of this with files too, can't you? Just remove or
truncate the file when the snapshot is no longer valid.
Yeah. I think adopting a solution similar to 2PC state files is a very
reasonable way to go here. This isn't likely to be a high-usage or
performance-critical feature, so it's not essential to keep the
information in shared memory for performance reasons.
regards, tom lane
On Mon, Feb 28, 2011 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past. Furthermore, we
can easily invalidate a snapshot that we have published earlier by
deleting its checksum in shared memory as soon as the original
transaction commits/aborts. And for these two a checksum seems to be a
good fit. Saving memory then comes as a benefit and makes all those
happy who don't want to argue about how many slots to reserve in
shared memory or don't want to have another GUC for what will probably
be a low-usage feature.But you can do all of this with files too, can't you? Just remove or
truncate the file when the snapshot is no longer valid.
Sure we can, but it looked like the consensus of the first discussion
was that the through-the-client approach was more flexible. But then
again nobody is actively arguing for that anymore.
On Feb 28, 2011, at 11:59 AM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe@mcknight.de> wrote:
Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past.But you can do all of this with files too, can't you? Just remove or
truncate the file when the snapshot is no longer valid.Yeah. I think adopting a solution similar to 2PC state files is a very
reasonable way to go here. This isn't likely to be a high-usage or
performance-critical feature, so it's not essential to keep the
information in shared memory for performance reasons.
Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes:
Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.
Involving the postmaster in this is entirely *not* reasonable. The
postmaster cannot do anything IPC-wise that the stats collector couldn't
do, and every additional function we load onto the postmaster is another
potential source of unrecoverable database-wide failures. The PM is
reliable only because it doesn't do much.
regards, tom lane
On Mar 1, 2011, at 10:54 PM, Tom Lane wrote:
Jim Nasby <jim@nasby.net> writes:
Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.
Involving the postmaster in this is entirely *not* reasonable. The
postmaster cannot do anything IPC-wise that the stats collector couldn't
do, and every additional function we load onto the postmaster is another
potential source of unrecoverable database-wide failures. The PM is
reliable only because it doesn't do much.
Makes sense. Doesn't have to be the postmaster; it could be some other process.
Anyway, I just wanted to throw the idea out as food for thought. I don't know if it'd be better or worse than temp files...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net