COPY with hints, rebirth
A long time ago, in a galaxy far away, we discussed ways to speed up
data loads/COPY.
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00470.php
In particular, the idea that we could mark tuples as committed while
we are still loading them, to avoid negative behaviour for the first
reader.
Simple patch to implement this is attached, together with test case.
Current behaviour is shown here
Run COPY and then... SELECT count(*) FROM table with no indexes
1st SELECT Time: 1518.571 ms <--- slowed dramatically by setting hint bits
2nd SELECT Time: 914.141 ms
3rd SELECT Time: 914.921 ms
With this patch I observed the following results
1st SELECT Time: 890.820 ms
2nd SELECT Time: 884.799 ms
3rd SELECT Time: 882.405 ms
What exactly does it do? Previously, we optimised COPY when it was
loading data into a newly created table or a freshly truncated table.
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load. Doing so is simple 2 lines of
code. The patch also adds some tests for corner cases that would make
that action break MVCC - though those cases are minor and typical data
loads will benefit fully from this.
In the link above, Tom suggested reworking HeapTupleSatisfiesMVCC()
and adding current xid to snapshots. That is an invasive change that I
would wish to avoid at any time and explains the long delay in
tackling this. The way I've implemented it, is just as a short test
during XidInMVCCSnapshot() so that we trap the case when the xid ==
xmax and so would appear to be running. This is much less invasive and
just as performant as Tom's original suggestion.
Why do we need this now? Setting checksums on page requires us to
write WAL for hints, so the situation of the 1st SELECT after a load
would get somewhat worse when page_checksums are enabled, but we
already know there is a price. However, this is a situation we can
solve, and add value for all cases, not just when checksums enabled.
So I'm posting this as a separate patch rather than including that as
a tuning feature of the checksums patch.
Your input will be generously received,
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
copy_nohints.v357.patchtext/x-diff; charset=US-ASCII; name=copy_nohints.v357.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..3899282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2056,6 +2056,17 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
tup->t_tableOid = RelationGetRelid(relation);
/*
+ * If we are inserting into a new relation invisible as yet to other
+ * backends and our session has no prior snapshots and no ready portals
+ * then we can also set the hint bit for the rows we are inserting. The
+ * last two restrictions ensure that HeapTupleSatisfiesMVCC() gives
+ * the right answer if the current transaction inspects the data after
+ * we load it.
+ */
+ if (options & HEAP_INSERT_HINT_XMIN)
+ tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+
+ /*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*/
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 110480f..6a60eb8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -43,6 +43,7 @@
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/portal.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -1922,6 +1923,10 @@ CopyFrom(CopyState cstate)
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
+
+ if (ThereAreNoPriorRegisteredSnapshots() &&
+ ThereAreNoReadyPortals())
+ hi_options |= HEAP_INSERT_HINT_XMIN;
}
/*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index cfb73c1..24075db 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -1055,3 +1055,22 @@ pg_cursor(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+bool
+ThereAreNoReadyPortals(void)
+{
+ HASH_SEQ_STATUS status;
+ PortalHashEnt *hentry;
+
+ hash_seq_init(&status, PortalHashTable);
+
+ while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Portal portal = hentry->portal;
+
+ if (portal->status == PORTAL_READY)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..5d9e3bf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1183,3 +1183,12 @@ DeleteAllExportedSnapshotFiles(void)
FreeDir(s_dir);
}
+
+bool
+ThereAreNoPriorRegisteredSnapshots(void)
+{
+ if (RegisteredSnapshots <= 1)
+ return true;
+
+ return false;
+}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 31791a7..a2e5524 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1256,7 +1256,18 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
- if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
+ if (TransactionIdFollows(xid, snapshot->xmax))
+ return true;
+ /*
+ * Make sure current xid is never thought to be in progress on tuples
+ * that are already marked as committed - for use in bulk COPY etc..
+ * We never included the current xid in snapshots, but if it happens
+ * to be xmax it could still be returned as in-progress.
+ */
+ if (TransactionIdIsCurrentTransactionId(xid))
+ return false;
+ /* All non-current xids >= xmax are seen as still running */
+ if (xid == snapshot->xmax)
return true;
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fa38803..0381785 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
+#define HEAP_INSERT_HINT_XMIN 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 4833942..bd2b133 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -219,5 +219,6 @@ extern void PortalDefineQuery(Portal portal,
extern Node *PortalListGetPrimaryStmt(List *stmts);
extern void PortalCreateHoldStore(Portal portal);
extern void PortalHashTableDeleteAll(void);
+extern bool ThereAreNoReadyPortals(void);
#endif /* PORTAL_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..789b72e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -46,5 +46,6 @@ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
+extern bool ThereAreNoPriorRegisteredSnapshots(void);
#endif /* SNAPMGR_H */
Simon Riggs <simon@2ndQuadrant.com> wrote:
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load.
Fantastic!
So, without bulk-load conditions, a long-lived tuple in PostgreSQL
is written to disk at least five times[1]If you are archiving, it could be more.:
(1) The WAL record for the inserted tuple is written.
(2) The inserted tuple is written.
(3) The HEAP_XMIN_COMMITTED bit is set and the tuple is re-written
in place some time after the inserting transaction's COMMIT.
(4) The WAL record for the "freeze" in write 5 is written.
(5) The xmin is set to frozen and the tuple is rewritten in place
some time after every other connection can see it.
Prior to your patch, bulk load omitted write 1. With your patch we
will also omit write 3.
Since you've just been looking at this area, do you have any
thoughts about writes 4 and 5 being rendered unnecessary by writing
bulk-loaded tuples with a frozen xmin, and having transactions with
a snapshot which doesn't include the bulk load's transaction just
not seeing the table? (Or am I just dreaming about those?)
-Kevin
[1]: If you are archiving, it could be more.
On Sat, Feb 25, 2012 at 6:24 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Simon Riggs <simon@2ndQuadrant.com> wrote:
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load.Fantastic!
So, without bulk-load conditions, a long-lived tuple in PostgreSQL
is written to disk at least five times[1]:(1) The WAL record for the inserted tuple is written.
(2) The inserted tuple is written.
(3) The HEAP_XMIN_COMMITTED bit is set and the tuple is re-written
in place some time after the inserting transaction's COMMIT.
(4) The WAL record for the "freeze" in write 5 is written.
(5) The xmin is set to frozen and the tuple is rewritten in place
some time after every other connection can see it.Prior to your patch, bulk load omitted write 1. With your patch we
will also omit write 3.
Yes, well explained.
Since you've just been looking at this area, do you have any
thoughts about writes 4 and 5 being rendered unnecessary by writing
bulk-loaded tuples with a frozen xmin, and having transactions with
a snapshot which doesn't include the bulk load's transaction just
not seeing the table? (Or am I just dreaming about those?)
Setting straight to frozen breaks MVCC, unless/until we use MVCC for
catalog access because we can see the table immediately and then read
the contents as if they had always been there.
I think we could add that as an option on COPY, since "breaking MVCC"
in that way is only a bad thing if it happens accidentally without the
user's permission and knowledge - which they may wish to give in many
cases, such as reloading a database from a dump.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 24.02.2012 22:55, Simon Riggs wrote:
A long time ago, in a galaxy far away, we discussed ways to speed up
data loads/COPY.
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00470.phpIn particular, the idea that we could mark tuples as committed while
we are still loading them, to avoid negative behaviour for the first
reader.Simple patch to implement this is attached, together with test case.
...
What exactly does it do? Previously, we optimised COPY when it was
loading data into a newly created table or a freshly truncated table.
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load. Doing so is simple 2 lines of
code. The patch also adds some tests for corner cases that would make
that action break MVCC - though those cases are minor and typical data
loads will benefit fully from this.
This doesn't work with subtransactions:
postgres=# create table a as select 1 as id;
SELECT 1
postgres=# copy a to '/tmp/a';
COPY 1
postgres=# begin;
BEGIN
postgres=# truncate a;
TRUNCATE TABLE
postgres=# savepoint sp1;
SAVEPOINT
postgres=# copy a from '/tmp/a';
COPY 1
postgres=# select * from a;
id
----
(0 rows)
The query should return the row copied in the same subtransaction.
In the link above, Tom suggested reworking HeapTupleSatisfiesMVCC()
and adding current xid to snapshots. That is an invasive change that I
would wish to avoid at any time and explains the long delay in
tackling this. The way I've implemented it, is just as a short test
during XidInMVCCSnapshot() so that we trap the case when the xid ==
xmax and so would appear to be running. This is much less invasive and
just as performant as Tom's original suggestion.
TransactionIdIsCurrentTransactionId() can be fairly expensive if you
have a lot of subtransactions open...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sat, Feb 25, 2012 at 6:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Feb 25, 2012 at 6:24 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:Simon Riggs <simon@2ndQuadrant.com> wrote:
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load.Fantastic!
I think we could add that as an option on COPY, since "breaking MVCC"
in that way is only a bad thing if it happens accidentally without the
user's permission and knowledge - which they may wish to give in many
cases, such as reloading a database from a dump.
I've coded up COPY FREEZE to do just that.
This version doesn't require any changes to transaction machinery.
But it is very effective at avoiding 4 out of the 5 writes you mention.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
copy_freeze.v1.patchtext/x-diff; charset=US-ASCII; name=copy_freeze.v1.patchDownload
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index a73b022..4912449 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -34,6 +34,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORMAT <replaceable class="parameter">format_name</replaceable>
OIDS [ <replaceable class="parameter">boolean</replaceable> ]
+ FREEZE [ <replaceable class="parameter">boolean</replaceable> ]
DELIMITER '<replaceable class="parameter">delimiter_character</replaceable>'
NULL '<replaceable class="parameter">null_string</replaceable>'
HEADER [ <replaceable class="parameter">boolean</replaceable> ]
@@ -181,6 +182,23 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</varlistentry>
<varlistentry>
+ <term><literal>FREEZE</literal></term>
+ <listitem>
+ <para>
+ Specifies copying the data with rows already frozen, just as they
+ would be after running the <command>VACUUM FREEZE</> command.
+ This only occurs if the table being loaded has been created in this
+ subtransaction, there are no cursors open and there are no older
+ snapshots held by this transaction.
+ Note that all sessions will immediately be able to see the data
+ if it has been successfully loaded, which specifically operates
+ outside of the normal definitions of MVCC. This is a performance
+ option intended for use only during initial data loads.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
<term><literal>DELIMITER</literal></term>
<listitem>
<para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..68534bf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2050,7 +2050,13 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
- HeapTupleHeaderSetXmin(tup->t_data, xid);
+ if (options & HEAP_INSERT_FREEZE)
+ {
+ HeapTupleHeaderSetXmin(tup->t_data, FrozenTransactionId);
+ tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+ }
+ else
+ HeapTupleHeaderSetXmin(tup->t_data, xid);
HeapTupleHeaderSetCmin(tup->t_data, cid);
HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */
tup->t_tableOid = RelationGetRelid(relation);
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 110480f..ec0bed8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -43,6 +43,7 @@
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/portal.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -108,6 +109,7 @@ typedef struct CopyStateData
char *filename; /* filename, or NULL for STDIN/STDOUT */
bool binary; /* binary format? */
bool oids; /* include OIDs? */
+ bool freeze; /* freeze rows on loading? */
bool csv_mode; /* Comma Separated Value format? */
bool header_line; /* CSV header line? */
char *null_print; /* NULL marker string (server encoding!) */
@@ -889,6 +891,14 @@ ProcessCopyOptions(CopyState cstate,
errmsg("conflicting or redundant options")));
cstate->oids = defGetBoolean(defel);
}
+ else if (strcmp(defel->defname, "freeze") == 0)
+ {
+ if (cstate->freeze)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ cstate->freeze = defGetBoolean(defel);
+ }
else if (strcmp(defel->defname, "delimiter") == 0)
{
if (cstate->delim)
@@ -1922,6 +1932,11 @@ CopyFrom(CopyState cstate)
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
+
+ if (cstate->freeze &&
+ ThereAreNoPriorRegisteredSnapshots() &&
+ ThereAreNoReadyPortals())
+ hi_options |= HEAP_INSERT_FREEZE;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d1ce2ab..cd2b243 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2315,6 +2315,10 @@ copy_opt_item:
{
$$ = makeDefElem("oids", (Node *)makeInteger(TRUE));
}
+ | FREEZE
+ {
+ $$ = makeDefElem("freeze", (Node *)makeInteger(TRUE));
+ }
| DELIMITER opt_as Sconst
{
$$ = makeDefElem("delimiter", (Node *)makeString($3));
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index cfb73c1..24075db 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -1055,3 +1055,22 @@ pg_cursor(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+bool
+ThereAreNoReadyPortals(void)
+{
+ HASH_SEQ_STATUS status;
+ PortalHashEnt *hentry;
+
+ hash_seq_init(&status, PortalHashTable);
+
+ while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Portal portal = hentry->portal;
+
+ if (portal->status == PORTAL_READY)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..5d9e3bf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1183,3 +1183,12 @@ DeleteAllExportedSnapshotFiles(void)
FreeDir(s_dir);
}
+
+bool
+ThereAreNoPriorRegisteredSnapshots(void)
+{
+ if (RegisteredSnapshots <= 1)
+ return true;
+
+ return false;
+}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fa38803..61472f2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
+#define HEAP_INSERT_FREEZE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 4833942..bd2b133 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -219,5 +219,6 @@ extern void PortalDefineQuery(Portal portal,
extern Node *PortalListGetPrimaryStmt(List *stmts);
extern void PortalCreateHoldStore(Portal portal);
extern void PortalHashTableDeleteAll(void);
+extern bool ThereAreNoReadyPortals(void);
#endif /* PORTAL_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..789b72e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -46,5 +46,6 @@ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
+extern bool ThereAreNoPriorRegisteredSnapshots(void);
#endif /* SNAPMGR_H */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 8e2bc0c..ad2c24c 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -239,6 +239,53 @@ a\.
\.b
c\.d
"\."
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ a
+ b
+ c
+(3 rows)
+
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+COMMIT;
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV FREEZE;
+SELECT * FROM vistest;
+ a
+---
+ a
+ b
+ c
+(3 rows)
+
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV FREEZE;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+COMMIT;
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 6322c8f..4da6452 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -164,6 +164,43 @@ c\.d
COPY testeoc TO stdout CSV;
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+a
+b
+c
+\.
+SELECT * FROM vistest;
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+d
+e
+f
+\.
+SELECT * FROM vistest;
+COMMIT;
+
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV FREEZE;
+a
+b
+c
+\.
+SELECT * FROM vistest;
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV FREEZE;
+d
+e
+f
+\.
+SELECT * FROM vistest;
+COMMIT;
+
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
On Wed, Feb 29, 2012 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
But it is very effective at avoiding 4 out of the 5 writes you mention.
For the "common case," would we not want to have (1) [WAL] and (2)
[Writing the pre-frozen tuple]?
If we only write the tuple (2), and don't capture WAL, then the COPY
wouldn't be replicable, right?
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
On Wed, Feb 29, 2012 at 6:14 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
On Wed, Feb 29, 2012 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
But it is very effective at avoiding 4 out of the 5 writes you mention.
For the "common case," would we not want to have (1) [WAL] and (2)
[Writing the pre-frozen tuple]?If we only write the tuple (2), and don't capture WAL, then the COPY
wouldn't be replicable, right?
Well, my answer is a question: how would you like it to work?
The way I coded it is that it will still write WAL if wal_level is
set, so it would be replicable. So it only works when writing to a
newly created table but is otherwise separate to whether WAL is
skipped.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 26, 2012 at 7:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 24.02.2012 22:55, Simon Riggs wrote:
A long time ago, in a galaxy far away, we discussed ways to speed up
data loads/COPY.
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00470.phpIn particular, the idea that we could mark tuples as committed while
we are still loading them, to avoid negative behaviour for the first
reader.Simple patch to implement this is attached, together with test case.
...
What exactly does it do? Previously, we optimised COPY when it was
loading data into a newly created table or a freshly truncated table.
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load. Doing so is simple 2 lines of
code. The patch also adds some tests for corner cases that would make
that action break MVCC - though those cases are minor and typical data
loads will benefit fully from this.This doesn't work with subtransactions:
...
The query should return the row copied in the same subtransaction.
Thanks for pointing that out.
New patch with corrected logic and test case attached.
In the link above, Tom suggested reworking HeapTupleSatisfiesMVCC()
and adding current xid to snapshots. That is an invasive change that I
would wish to avoid at any time and explains the long delay in
tackling this. The way I've implemented it, is just as a short test
during XidInMVCCSnapshot() so that we trap the case when the xid ==
xmax and so would appear to be running. This is much less invasive and
just as performant as Tom's original suggestion.TransactionIdIsCurrentTransactionId() can be fairly expensive if you have a
lot of subtransactions open...
I've put in something to avoid that cost for the common case - just a boolean.
This seems like the best plan rather than the explicit FREEZE option.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
copy_nohints.v358.patchtext/x-diff; charset=US-ASCII; name=copy_nohints.v358.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..3899282 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2056,6 +2056,17 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
tup->t_tableOid = RelationGetRelid(relation);
/*
+ * If we are inserting into a new relation invisible as yet to other
+ * backends and our session has no prior snapshots and no ready portals
+ * then we can also set the hint bit for the rows we are inserting. The
+ * last two restrictions ensure that HeapTupleSatisfiesMVCC() gives
+ * the right answer if the current transaction inspects the data after
+ * we load it.
+ */
+ if (options & HEAP_INSERT_HINT_XMIN)
+ tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+
+ /*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*/
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e22bdac..de7504c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -146,6 +146,7 @@ typedef struct TransactionStateData
int prevSecContext; /* previous SecurityRestrictionContext */
bool prevXactReadOnly; /* entry-time xact r/o state */
bool startedInRecovery; /* did we start in recovery? */
+ bool maySeePreHintedTuples; /* did subtrans write pre-hinted rows? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -175,6 +176,7 @@ static TransactionStateData TopTransactionStateData = {
0, /* previous SecurityRestrictionContext */
false, /* entry-time xact r/o state */
false, /* startedInRecovery */
+ false, /* maySeePreHintedTuples */
NULL /* link to parent state block */
};
@@ -704,6 +706,18 @@ TransactionStartedDuringRecovery(void)
return CurrentTransactionState->startedInRecovery;
}
+bool
+TransactionMaySeePreHintedTuples(void)
+{
+ return CurrentTransactionState->maySeePreHintedTuples;
+}
+
+void
+TransactionMayWritePreHintedTuples(void)
+{
+ CurrentTransactionState->maySeePreHintedTuples = true;
+}
+
/*
* CommandCounterIncrement
*/
@@ -1689,6 +1703,7 @@ StartTransaction(void)
s->startedInRecovery = false;
XactReadOnly = DefaultXactReadOnly;
}
+ s->maySeePreHintedTuples = false;
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
forceSyncCommit = false;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 110480f..1419e33 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -43,6 +43,7 @@
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/portal.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -1922,6 +1923,13 @@ CopyFrom(CopyState cstate)
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
+
+ if (ThereAreNoPriorRegisteredSnapshots() &&
+ ThereAreNoReadyPortals())
+ {
+ hi_options |= HEAP_INSERT_HINT_XMIN;
+ TransactionMayWritePreHintedTuples();
+ }
}
/*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index cfb73c1..24075db 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -1055,3 +1055,22 @@ pg_cursor(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+bool
+ThereAreNoReadyPortals(void)
+{
+ HASH_SEQ_STATUS status;
+ PortalHashEnt *hentry;
+
+ hash_seq_init(&status, PortalHashTable);
+
+ while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Portal portal = hentry->portal;
+
+ if (portal->status == PORTAL_READY)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..5d9e3bf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1183,3 +1183,12 @@ DeleteAllExportedSnapshotFiles(void)
FreeDir(s_dir);
}
+
+bool
+ThereAreNoPriorRegisteredSnapshots(void)
+{
+ if (RegisteredSnapshots <= 1)
+ return true;
+
+ return false;
+}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 31791a7..6cb4152 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1255,6 +1255,28 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
+
+ /*
+ * TransactionIdIsCurrentTransactionId() could be expensive so avoid
+ * running it unless we know that the current subtransaction has
+ * decided to write tuples that already have HEAP_XMIN_COMMITTED set.
+ * In that case we end up here when we run HeapTupleSatisfiesMVCC.
+ */
+ if (TransactionMaySeePreHintedTuples())
+ {
+ /*
+ * Make sure current xid is never thought to be in progress on tuples
+ * that are already marked as committed - for use in bulk COPY etc..
+ * We never included the current xid in snapshots, but its possible
+ * that the current xid is equal to or greater than xmax, so we
+ * check for the current xid before we check for xmax.
+ * We could optimise this further by stopping the search as soon
+ * as the xid is less than xmax, but seems like a step too far.
+ */
+ if (TransactionIdIsCurrentTransactionId(xid))
+ return false;
+ }
+
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fa38803..0381785 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
+#define HEAP_INSERT_HINT_XMIN 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 20e344e..7fb0123 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -217,6 +217,8 @@ extern TimestampTz GetCurrentTransactionStopTimestamp(void);
extern void SetCurrentStatementStartTimestamp(void);
extern int GetCurrentTransactionNestLevel(void);
extern bool TransactionIdIsCurrentTransactionId(TransactionId xid);
+extern bool TransactionMaySeePreHintedTuples(void);
+extern void TransactionMayWritePreHintedTuples(void);
extern void CommandCounterIncrement(void);
extern void ForceSyncCommit(void);
extern void StartTransactionCommand(void);
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 4833942..bd2b133 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -219,5 +219,6 @@ extern void PortalDefineQuery(Portal portal,
extern Node *PortalListGetPrimaryStmt(List *stmts);
extern void PortalCreateHoldStore(Portal portal);
extern void PortalHashTableDeleteAll(void);
+extern bool ThereAreNoReadyPortals(void);
#endif /* PORTAL_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..789b72e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -46,5 +46,6 @@ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
+extern bool ThereAreNoPriorRegisteredSnapshots(void);
#endif /* SNAPMGR_H */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 8e2bc0c..7461529 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -239,6 +239,49 @@ a\.
\.b
c\.d
"\."
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ a
+ b
+ c
+(3 rows)
+
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+SAVEPOINT s2;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
+COMMIT;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 6322c8f..a1a8d97 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -164,6 +164,34 @@ c\.d
COPY testeoc TO stdout CSV;
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+a
+b
+c
+\.
+SELECT * FROM vistest;
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+d
+e
+f
+\.
+SELECT * FROM vistest;
+SAVEPOINT s2;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+x
+y
+z
+\.
+SELECT * FROM vistest;
+COMMIT;
+SELECT * FROM vistest;
+
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
On 01.03.2012 18:40, Simon Riggs wrote:
On Sun, Feb 26, 2012 at 7:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 24.02.2012 22:55, Simon Riggs wrote:
What exactly does it do? Previously, we optimised COPY when it was
loading data into a newly created table or a freshly truncated table.
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load. Doing so is simple 2 lines of
code. The patch also adds some tests for corner cases that would make
that action break MVCC - though those cases are minor and typical data
loads will benefit fully from this.This doesn't work with subtransactions:
...
The query should return the row copied in the same subtransaction.
Thanks for pointing that out.
New patch with corrected logic and test case attached.
It's still broken:
-- create test table and file
create table a as select 1 as id;
copy a to '/tmp/a';
-- start test
postgres=# begin;
BEGIN
postgres=# truncate a;
TRUNCATE TABLE
postgres=# savepoint sp1;
SAVEPOINT
postgres=# copy a from '/tmp/a';
COPY 1
postgres=# select * from a;
id
----
1
(1 row)
postgres=# rollback to savepoint sp1;
ROLLBACK
postgres=# select * from a;
id
----
1
(1 row)
That last select should not have seen the tuple.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 01.03.2012 18:40, Simon Riggs wrote:
On Sun, Feb 26, 2012 at 7:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 24.02.2012 22:55, Simon Riggs wrote:
What exactly does it do? Previously, we optimised COPY when it was
loading data into a newly created table or a freshly truncated table.
This patch extends that and actually sets the tuple header flag as
HEAP_XMIN_COMMITTED during the load. Doing so is simple 2 lines of
code. The patch also adds some tests for corner cases that would make
that action break MVCC - though those cases are minor and typical data
loads will benefit fully from this.This doesn't work with subtransactions:
...
The query should return the row copied in the same subtransaction.
Thanks for pointing that out.
New patch with corrected logic and test case attached.
It's still broken:
Agreed. Thanks for your detailed input.
Performance test results show that playing with XidInMVCCSnapshot()
causes a small but growing performance regression that I find
unacceptable, so while I might fix the above, I doubt if I can improve
this as well.
So this approach isn't the one...
The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Mar 02, 2012 at 08:46:45AM +0000, Simon Riggs wrote:
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
It's still broken:
[BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO]
So this approach isn't the one...
The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.
You can salvage the optimization by tightening its prerequisite: use it when
the current subtransaction or a child thereof created or truncated the table.
A parent subtransaction having done so is acceptable for the WAL avoidance
optimization but not for this.
A COPY FREEZE ignoring that stronger restriction would be an interesting
feature in its own right, but it brings up other problems. For example,
suppose you write a tuple, then fail while writing its index entries. The
tuple is already frozen and visible, but it lacks a full set of live index
entries. The subtransaction aborts, but the top transaction commits and makes
the situation permanent.
Incidentally, I contend that we should write frozen tuples to new/truncated
tables unconditionally. The current behavior of making old snapshots see the
table as empty violates atomicity at least as badly as letting those snapshots
see the future-snapshot contents. But Marti has a sound proposal that would
interact with your efforts here to avoid violating atomicity at all:
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com
Thanks,
nm
Noah Misch <noah@leadboat.com> wrote:
Incidentally, I contend that we should write frozen tuples to
new/truncated tables unconditionally.
+1
The current behavior of making old snapshots see the table as
empty violates atomicity at least as badly as letting those
snapshots see the future-snapshot contents.
Right, there was no point where the table existed as empty at the
end of a transaction, so it is quite broken as things stand now.
But Marti has a sound proposal that would interact with your
efforts here to avoid violating atomicity at all
Well, getting it right is certainly better than moving from a slow
non-conforming behavior to a fast non-conforming behavior. ;-)
-Kevin
On Fri, Mar 2, 2012 at 8:58 PM, Noah Misch <noah@leadboat.com> wrote:
On Fri, Mar 02, 2012 at 08:46:45AM +0000, Simon Riggs wrote:
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
It's still broken:
[BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO]
So this approach isn't the one...
The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.You can salvage the optimization by tightening its prerequisite: use it when
the current subtransaction or a child thereof created or truncated the table.
A parent subtransaction having done so is acceptable for the WAL avoidance
optimization but not for this.
That's already what it does. The problem is what happens after the COPY.
If we said "you can't see the rows you just loaded" it would be
problem over, but in my opinion that needs an explicit request from
the user to show they accept that.
A COPY FREEZE ignoring that stronger restriction would be an interesting
feature in its own right, but it brings up other problems. For example,
suppose you write a tuple, then fail while writing its index entries. The
tuple is already frozen and visible, but it lacks a full set of live index
entries. The subtransaction aborts, but the top transaction commits and makes
the situation permanent.
The optimisation only works in the subtransaction that created the
relfilenode so that isn't an issue.
Thanks for your input.
Incidentally, I contend that we should write frozen tuples to new/truncated
tables unconditionally. The current behavior of making old snapshots see the
table as empty violates atomicity at least as badly as letting those snapshots
see the future-snapshot contents. But Marti has a sound proposal that would
interact with your efforts here to avoid violating atomicity at all:
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com
Personally, that seems a pretty reasonable thing.
I like Marti's idea. At present, making his idea work could easily
mean checksums sink, so not sure whether to attempt to make that work
in detail.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote:
I like Marti's idea. At present, making his idea work could easily
mean checksums sink, so not sure whether to attempt to make that
work in detail.
For my part, improving bulk load performance and TRUNCATE
transactional semantics would trump checksums. If we have to defer
one or the other to a later release, I would prefer that we bump
checksums.
While I understand the desire for checksums, and understand others
have had situations where they would have helped, so far I have yet
to run into a situation where I feel they would have helped me. The
hint bit and freeze issues require ongoing attention and have real
performance consequences on a regular basis. And closing the window
for odd transactional semantics on TRUNCATE versus DELETE of all
rows in a table would be one less thing to worry about, in my world.
-Kevin
On Fri, Mar 2, 2012 at 8:58 PM, Noah Misch <noah@leadboat.com> wrote:
On Fri, Mar 02, 2012 at 08:46:45AM +0000, Simon Riggs wrote:
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
It's still broken:
[BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO]
So this approach isn't the one...
The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.You can salvage the optimization by tightening its prerequisite: use it when
the current subtransaction or a child thereof created or truncated the table.
A parent subtransaction having done so is acceptable for the WAL avoidance
optimization but not for this.
I misread your earlier comment. Yes, that will make this work correctly.
Incidentally, I contend that we should write frozen tuples to new/truncated
tables unconditionally. The current behavior of making old snapshots see the
table as empty violates atomicity at least as badly as letting those snapshots
see the future-snapshot contents. But Marti has a sound proposal that would
interact with your efforts here to avoid violating atomicity at all:
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com
Thank you for bringing this to my attention.
This will make this optimisation work correctly without adding
anything to the main code path of XidInMVCCSnapshot() and without the
annoying FREEZE syntax. So this puts the patch squarely back on the
table.
I'll do another version of this later today designed to work with the
StrictMVCC patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Mar 3, 2012 at 1:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Mar 2, 2012 at 8:58 PM, Noah Misch <noah@leadboat.com> wrote:
On Fri, Mar 02, 2012 at 08:46:45AM +0000, Simon Riggs wrote:
On Thu, Mar 1, 2012 at 8:49 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
It's still broken:
[BEGIN;TRUNCATE;SAVEPOINT;COPY;ROLLBACK TO]
So this approach isn't the one...
The COPY FREEZE patch provides a way for the user to say explicitly
that they don't really care about these MVCC corner cases and as a
result allows us to avoid touching XidInMVCCSnapshot() at all. So
there is still a patch on the table.You can salvage the optimization by tightening its prerequisite: use it when
the current subtransaction or a child thereof created or truncated the table.
A parent subtransaction having done so is acceptable for the WAL avoidance
optimization but not for this.I misread your earlier comment. Yes, that will make this work correctly.
Incidentally, I contend that we should write frozen tuples to new/truncated
tables unconditionally. The current behavior of making old snapshots see the
table as empty violates atomicity at least as badly as letting those snapshots
see the future-snapshot contents. But Marti has a sound proposal that would
interact with your efforts here to avoid violating atomicity at all:
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.comThank you for bringing this to my attention.
This will make this optimisation work correctly without adding
anything to the main code path of XidInMVCCSnapshot() and without the
annoying FREEZE syntax. So this puts the patch squarely back on the
table.I'll do another version of this later today designed to work with the
StrictMVCC patch.
OK, so latest version of patch handling the subtransaction issues
correctly. Thanks to Noah and Heikki for identifying them and hitting
me with the clue stick.
So this version of patch has negligible effect on mainline performance
though leaves newly loaded data visible in ways that would break MVCC.
So this patch relies on the safe_truncate.v2.patch posted on separate
thread.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
copy_autofrozen.v359.patchtext/x-diff; charset=US-ASCII; name=copy_autofrozen.v359.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..b891327 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2050,12 +2050,27 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
- HeapTupleHeaderSetXmin(tup->t_data, xid);
HeapTupleHeaderSetCmin(tup->t_data, cid);
HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */
tup->t_tableOid = RelationGetRelid(relation);
/*
+ * If we are inserting into a new relation invisible as yet to other
+ * backends and our session has no prior snapshots and no ready portals
+ * then we can also set the hint bit for the rows we are inserting. The
+ * last two restrictions ensure that HeapTupleSatisfiesMVCC() gives
+ * the right answer if the current transaction inspects the data after
+ * we load it.
+ */
+ if (options & HEAP_INSERT_HINT_XMIN)
+ {
+ tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+ HeapTupleHeaderSetXmin(tup->t_data, FrozenTransactionId);
+ }
+ else
+ HeapTupleHeaderSetXmin(tup->t_data, xid);
+
+ /*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*/
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 110480f..a34f072 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -43,6 +43,7 @@
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/portal.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -1922,6 +1923,21 @@ CopyFrom(CopyState cstate)
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
+
+ if (ThereAreNoPriorRegisteredSnapshots() &&
+ ThereAreNoReadyPortals())
+ {
+ SubTransactionId currxid = GetCurrentSubTransactionId();
+
+ /*
+ * If the relfilenode has been created in this subtransaction
+ * then we can further optimise the data load by setting hint
+ * bits and pre-freezing tuples.
+ */
+ if (cstate->rel->rd_createSubid == currxid ||
+ cstate->rel->rd_newRelfilenodeSubid == currxid)
+ hi_options |= HEAP_INSERT_HINT_XMIN;
+ }
}
/*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index cfb73c1..24075db 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -1055,3 +1055,22 @@ pg_cursor(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+bool
+ThereAreNoReadyPortals(void)
+{
+ HASH_SEQ_STATUS status;
+ PortalHashEnt *hentry;
+
+ hash_seq_init(&status, PortalHashTable);
+
+ while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Portal portal = hentry->portal;
+
+ if (portal->status == PORTAL_READY)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..5d9e3bf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1183,3 +1183,12 @@ DeleteAllExportedSnapshotFiles(void)
FreeDir(s_dir);
}
+
+bool
+ThereAreNoPriorRegisteredSnapshots(void)
+{
+ if (RegisteredSnapshots <= 1)
+ return true;
+
+ return false;
+}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fa38803..0381785 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
+#define HEAP_INSERT_HINT_XMIN 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 4833942..bd2b133 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -219,5 +219,6 @@ extern void PortalDefineQuery(Portal portal,
extern Node *PortalListGetPrimaryStmt(List *stmts);
extern void PortalCreateHoldStore(Portal portal);
extern void PortalHashTableDeleteAll(void);
+extern bool ThereAreNoReadyPortals(void);
#endif /* PORTAL_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..789b72e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -46,5 +46,6 @@ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
+extern bool ThereAreNoPriorRegisteredSnapshots(void);
#endif /* SNAPMGR_H */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 8e2bc0c..b9626e5 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -239,6 +239,70 @@ a\.
\.b
c\.d
"\."
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ a
+ b
+ c
+(3 rows)
+
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+SAVEPOINT s2;
+TRUNCATE vistest;
+SAVEPOINT s3;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ j
+ k
+ l
+(3 rows)
+
+ROLLBACK TO SAVEPOINT s2;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
+COMMIT;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
+DROP TABLE vistest;
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 6322c8f..3d2c376 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -164,6 +164,45 @@ c\.d
COPY testeoc TO stdout CSV;
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+a
+b
+c
+\.
+SELECT * FROM vistest;
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+d
+e
+f
+\.
+SELECT * FROM vistest;
+SAVEPOINT s2;
+TRUNCATE vistest;
+SAVEPOINT s3;
+COPY vistest FROM stdin CSV;
+j
+k
+l
+\.
+SELECT * FROM vistest;
+ROLLBACK TO SAVEPOINT s2;
+SELECT * FROM vistest;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+x
+y
+z
+\.
+SELECT * FROM vistest;
+COMMIT;
+SELECT * FROM vistest;
+
+DROP TABLE vistest;
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
On Fri, Mar 2, 2012 at 11:15 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Simon Riggs <simon@2ndQuadrant.com> wrote:
I like Marti's idea. At present, making his idea work could easily
mean checksums sink, so not sure whether to attempt to make that
work in detail.For my part, improving bulk load performance and TRUNCATE
transactional semantics would trump checksums. If we have to defer
one or the other to a later release, I would prefer that we bump
checksums.While I understand the desire for checksums, and understand others
have had situations where they would have helped, so far I have yet
to run into a situation where I feel they would have helped me. The
hint bit and freeze issues require ongoing attention and have real
performance consequences on a regular basis. And closing the window
for odd transactional semantics on TRUNCATE versus DELETE of all
rows in a table would be one less thing to worry about, in my world.
Since you supported this, I've invested the time to make it work. It
doesn't look like it needs to be either-or.
Please review the safe_truncate.v2.patch and
copy_autofrozen.v359.patch, copied here to assist testing and
inspection.
At present those patches handle only the TRUNCATE/COPY optimisation
but we could easily add CTAS, CREATE/COPY, CLUSTERVACUUM FULL etc..
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
copy_autofrozen.v359.patchtext/x-diff; charset=US-ASCII; name=copy_autofrozen.v359.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..b891327 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2050,12 +2050,27 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
- HeapTupleHeaderSetXmin(tup->t_data, xid);
HeapTupleHeaderSetCmin(tup->t_data, cid);
HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */
tup->t_tableOid = RelationGetRelid(relation);
/*
+ * If we are inserting into a new relation invisible as yet to other
+ * backends and our session has no prior snapshots and no ready portals
+ * then we can also set the hint bit for the rows we are inserting. The
+ * last two restrictions ensure that HeapTupleSatisfiesMVCC() gives
+ * the right answer if the current transaction inspects the data after
+ * we load it.
+ */
+ if (options & HEAP_INSERT_HINT_XMIN)
+ {
+ tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+ HeapTupleHeaderSetXmin(tup->t_data, FrozenTransactionId);
+ }
+ else
+ HeapTupleHeaderSetXmin(tup->t_data, xid);
+
+ /*
* If the new tuple is too big for storage or contains already toasted
* out-of-line attributes from some other relation, invoke the toaster.
*/
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 110480f..a34f072 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -43,6 +43,7 @@
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/portal.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -1922,6 +1923,21 @@ CopyFrom(CopyState cstate)
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
+
+ if (ThereAreNoPriorRegisteredSnapshots() &&
+ ThereAreNoReadyPortals())
+ {
+ SubTransactionId currxid = GetCurrentSubTransactionId();
+
+ /*
+ * If the relfilenode has been created in this subtransaction
+ * then we can further optimise the data load by setting hint
+ * bits and pre-freezing tuples.
+ */
+ if (cstate->rel->rd_createSubid == currxid ||
+ cstate->rel->rd_newRelfilenodeSubid == currxid)
+ hi_options |= HEAP_INSERT_HINT_XMIN;
+ }
}
/*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index cfb73c1..24075db 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -1055,3 +1055,22 @@ pg_cursor(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+bool
+ThereAreNoReadyPortals(void)
+{
+ HASH_SEQ_STATUS status;
+ PortalHashEnt *hentry;
+
+ hash_seq_init(&status, PortalHashTable);
+
+ while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Portal portal = hentry->portal;
+
+ if (portal->status == PORTAL_READY)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..5d9e3bf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1183,3 +1183,12 @@ DeleteAllExportedSnapshotFiles(void)
FreeDir(s_dir);
}
+
+bool
+ThereAreNoPriorRegisteredSnapshots(void)
+{
+ if (RegisteredSnapshots <= 1)
+ return true;
+
+ return false;
+}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fa38803..0381785 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -26,6 +26,7 @@
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
+#define HEAP_INSERT_HINT_XMIN 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 4833942..bd2b133 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -219,5 +219,6 @@ extern void PortalDefineQuery(Portal portal,
extern Node *PortalListGetPrimaryStmt(List *stmts);
extern void PortalCreateHoldStore(Portal portal);
extern void PortalHashTableDeleteAll(void);
+extern bool ThereAreNoReadyPortals(void);
#endif /* PORTAL_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..789b72e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -46,5 +46,6 @@ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
+extern bool ThereAreNoPriorRegisteredSnapshots(void);
#endif /* SNAPMGR_H */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 8e2bc0c..b9626e5 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -239,6 +239,70 @@ a\.
\.b
c\.d
"\."
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ a
+ b
+ c
+(3 rows)
+
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+SAVEPOINT s2;
+TRUNCATE vistest;
+SAVEPOINT s3;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ j
+ k
+ l
+(3 rows)
+
+ROLLBACK TO SAVEPOINT s2;
+SELECT * FROM vistest;
+ a
+---
+ d
+ e
+ f
+(3 rows)
+
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
+COMMIT;
+SELECT * FROM vistest;
+ a
+---
+ x
+ y
+ z
+(3 rows)
+
+DROP TABLE vistest;
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 6322c8f..3d2c376 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -164,6 +164,45 @@ c\.d
COPY testeoc TO stdout CSV;
+CREATE TABLE vistest (LIKE testeoc);
+BEGIN;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+a
+b
+c
+\.
+SELECT * FROM vistest;
+SAVEPOINT s1;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+d
+e
+f
+\.
+SELECT * FROM vistest;
+SAVEPOINT s2;
+TRUNCATE vistest;
+SAVEPOINT s3;
+COPY vistest FROM stdin CSV;
+j
+k
+l
+\.
+SELECT * FROM vistest;
+ROLLBACK TO SAVEPOINT s2;
+SELECT * FROM vistest;
+TRUNCATE vistest;
+COPY vistest FROM stdin CSV;
+x
+y
+z
+\.
+SELECT * FROM vistest;
+COMMIT;
+SELECT * FROM vistest;
+
+DROP TABLE vistest;
DROP TABLE x, y;
DROP FUNCTION fn_x_before();
DROP FUNCTION fn_x_after();
safe_truncate.v2.patchtext/x-diff; charset=US-ASCII; name=safe_truncate.v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..4387f49 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
/* GUC variable */
bool synchronize_seqscans = true;
-
+bool StrictMVCC = true;
static HeapScanDesc heap_beginscan_internal(Relation relation,
Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
HeapScanDesc scan;
/*
+ * If the snapshot is older than relvalidxmin then that either a table
+ * has only recently been created or that a TRUNCATE has removed data
+ * that we should still be able to see. Either way, we aren't allowed
+ * to view the rows in StrictMVCC mode.
+ */
+ if (StrictMVCC &&
+ TransactionIdIsValid(relation->rd_rel->relvalidxmin) &&
+ TransactionIdIsValid(snapshot->xmax) &&
+ NormalTransactionIdPrecedes(snapshot->xmax, relation->rd_rel->relvalidxmin))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("canceling statement due to conflict with TRUNCATE TABLE on %s",
+ NameStr(relation->rd_rel->relname)),
+ errdetail("User query attempting to see row versions that have been removed.")));
+ }
+
+ /*
* increment relation ref count while scanning relation
*
* This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..3f9bd5d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel->relhastriggers);
values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
+ values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel->relvalidxmin);
if (relacl != (Datum) 0)
values[Anum_pg_class_relacl - 1] = relacl;
else
@@ -884,6 +885,7 @@ AddNewRelationTuple(Relation pg_class_desc,
new_rel_reltup->relfrozenxid = InvalidTransactionId;
}
+ new_rel_reltup->relvalidxmin = InvalidTransactionId;
new_rel_reltup->relowner = relowner;
new_rel_reltup->reltype = new_type_oid;
new_rel_reltup->reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
}
/* We'll build a new physical relation for the index */
- RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+ RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
/* Initialize the index and rebuild */
/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..0578241 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -82,6 +82,8 @@ static int elevel = -1;
static MemoryContext anl_context = NULL;
+static TransactionId OldestXmin;
+
static BufferAccessStrategy vac_strategy;
@@ -538,7 +540,8 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
totalrows,
visibilitymap_count(onerel),
hasindex,
- InvalidTransactionId);
+ InvalidTransactionId,
+ OldestXmin);
/*
* Same for indexes. Vacuum always scans all indexes, so if we're part of
@@ -558,6 +561,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
totalindexrows,
0,
false,
+ InvalidTransactionId,
InvalidTransactionId);
}
}
@@ -1025,7 +1029,6 @@ acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
double deadrows = 0; /* # dead rows seen */
double rowstoskip = -1; /* -1 means not set yet */
BlockNumber totalblocks;
- TransactionId OldestXmin;
BlockSamplerData bs;
double rstate;
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 1f95abc..ab4aec2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -287,7 +287,7 @@ ResetSequence(Oid seq_relid)
* Create a new storage file for the sequence. We want to keep the
* sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
*/
- RelationSetNewRelfilenode(seq_rel, InvalidTransactionId);
+ RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, InvalidTransactionId);
/*
* Insert the modified tuple into the new storage file.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd4490a..aa9d2e9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -19,6 +19,7 @@
#include "access/reloptions.h"
#include "access/relscan.h"
#include "access/sysattr.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
@@ -1121,7 +1122,7 @@ ExecuteTruncate(TruncateStmt *stmt)
* as the relfilenode value. The old storage file is scheduled for
* deletion at commit.
*/
- RelationSetNewRelfilenode(rel, RecentXmin);
+ RelationSetNewRelfilenode(rel, RecentXmin, GetCurrentTransactionId());
heap_relid = RelationGetRelid(rel);
toast_relid = rel->rd_rel->reltoastrelid;
@@ -1132,7 +1133,12 @@ ExecuteTruncate(TruncateStmt *stmt)
if (OidIsValid(toast_relid))
{
rel = relation_open(toast_relid, AccessExclusiveLock);
- RelationSetNewRelfilenode(rel, RecentXmin);
+ RelationSetNewRelfilenode(rel, RecentXmin, InvalidTransactionId);
+
+ /*
+ * We don't need a cmin here because there can't be any open
+ * cursors in the same session as the TRUNCATE command.
+ */
heap_close(rel, NoLock);
}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 353af50..7609e59 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -573,7 +573,8 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
- bool hasindex, TransactionId frozenxid)
+ bool hasindex, TransactionId frozenxid,
+ TransactionId oldestxmin)
{
Oid relid = RelationGetRelid(relation);
Relation rd;
@@ -647,6 +648,17 @@ vac_update_relstats(Relation relation,
dirty = true;
}
+ /*
+ * Reset relvalidxmin if its old enough
+ */
+ if (TransactionIdIsValid(oldestxmin) &&
+ TransactionIdIsNormal(pgcform->relvalidxmin) &&
+ TransactionIdPrecedes(pgcform->relvalidxmin, oldestxmin))
+ {
+ pgcform->relvalidxmin = InvalidTransactionId;
+ dirty = true;
+ }
+
/* If anything changed, write out the tuple. */
if (dirty)
heap_inplace_update(rd, ctup);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2fc749e..46d6ab8 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -255,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
new_rel_tuples,
new_rel_allvisible,
vacrelstats->hasindex,
- new_frozen_xid);
+ new_frozen_xid,
+ OldestXmin);
/* report results to the stats collector, too */
pgstat_report_vacuum(RelationGetRelid(onerel),
@@ -1185,6 +1186,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_index_tuples,
0,
false,
+ InvalidTransactionId,
InvalidTransactionId);
ereport(elevel,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a59950e..17303e8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2605,7 +2605,8 @@ RelationBuildLocalRelation(const char *relname,
* the XIDs that will be put into the new relation contents.
*/
void
-RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
+RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid,
+ TransactionId validxmin)
{
Oid newrelfilenode;
RelFileNodeBackend newrnode;
@@ -2674,6 +2675,10 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
}
classform->relfrozenxid = freezeXid;
+ if (validxmin != InvalidTransactionId &&
+ TransactionIdPrecedes(classform->relvalidxmin, validxmin))
+ classform->relvalidxmin = validxmin;
+
simple_heap_update(pg_class, &tuple->t_self, tuple);
CatalogUpdateIndexes(pg_class, tuple);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 486bdcd..1561889 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1430,6 +1430,19 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"strict_mvcc", PGC_POSTMASTER, COMPAT_OPTIONS_PREVIOUS,
+ gettext_noop("Enables backward compatibility mode of strict MVCC "
+ "for TRUNCATE and CREATE TABLE."),
+ gettext_noop("When enabled viewing a truncated or newly created table "
+ "will throw a serialization error to prevent MVCC "
+ "discrepancies that were possible priot to 9.2.")
+ },
+ &StrictMVCC,
+ true,
+ NULL, NULL, NULL
+ },
+
+ {
{"quote_all_identifiers", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
gettext_noop("When generating SQL fragments, quote all identifiers."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..f56e374 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -548,6 +548,7 @@
#quote_all_identifiers = off
#sql_inheritance = on
#standard_conforming_strings = on
+#strict_mvcc = on
#synchronize_seqscans = on
# - Other Platforms and Clients -
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 0335347..223f157 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201203021
+#define CATALOG_VERSION_NO 201203031
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 1567206..d6ee70c 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -67,6 +67,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
bool relhastriggers; /* has (or has had) any TRIGGERs */
bool relhassubclass; /* has (or has had) derived classes */
TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
+ TransactionId relvalidxmin; /* data is only valid for snapshots > this Xid */
#ifdef CATALOG_VARLEN /* variable-length fields start here */
/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
@@ -77,7 +78,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
/* Size of fixed part of pg_class tuples, not counting var-length fields */
#define CLASS_TUPLE_SIZE \
- (offsetof(FormData_pg_class,relfrozenxid) + sizeof(TransactionId))
+ (offsetof(FormData_pg_class,relvalidxmin) + sizeof(TransactionId))
/* ----------------
* Form_pg_class corresponds to a pointer to a tuple with
@@ -91,7 +92,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 27
+#define Natts_pg_class 28
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -117,8 +118,9 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relhastriggers 23
#define Anum_pg_class_relhassubclass 24
#define Anum_pg_class_relfrozenxid 25
-#define Anum_pg_class_relacl 26
-#define Anum_pg_class_reloptions 27
+#define Anum_pg_class_relvalidxmin 26
+#define Anum_pg_class_relacl 27
+#define Anum_pg_class_reloptions 28
/* ----------------
* initial contents of pg_class
@@ -130,13 +132,13 @@ typedef FormData_pg_class *Form_pg_class;
*/
/* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 0 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 0 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 0 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 0 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4526648..fa79231 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -151,7 +151,8 @@ extern void vac_update_relstats(Relation relation,
double num_tuples,
BlockNumber num_all_visible_pages,
bool hasindex,
- TransactionId frozenxid);
+ TransactionId frozenxid,
+ TransactionId oldestxmin);
extern void vacuum_set_xid_limits(int freeze_min_age, int freeze_table_age,
bool sharedRel,
TransactionId *oldestXmin,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 610cb59..9d2be31 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -342,6 +342,8 @@ extern ProcessingMode Mode;
#define GetProcessingMode() Mode
+/* in access/heap/heapam.c */
+extern bool StrictMVCC;
/*****************************************************************************
* pinit.h -- *
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f39486..60c2eb6 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -76,7 +76,8 @@ extern Relation RelationBuildLocalRelation(const char *relname,
* Routine to manage assignment of new relfilenode to a relation
*/
extern void RelationSetNewRelfilenode(Relation relation,
- TransactionId freezeXid);
+ TransactionId freezeXid,
+ TransactionId validxmin);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 669c0f2..bba722a 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -14,3 +14,4 @@ test: fk-contention
test: fk-deadlock
test: fk-deadlock2
test: eval-plan-qual
+test: truncate
diff --git a/src/test/isolation/expected/truncate.out b/src/test/isolation/expected/truncate.out
new file mode 100644
index ...2b9f161
*** a/src/test/isolation/expected/truncate.out
--- b/src/test/isolation/expected/truncate.out
***************
*** 0 ****
--- 1,63 ----
+ Parsed test spec with 2 sessions
+
+ starting permutation: begin select roll trunc
+ step begin:
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ -- Force snapshot open
+ SELECT 1 AS foo FROM txid_current();
+
+ foo
+
+ 1
+ step select: SELECT * FROM foo;
+ i
+
+ 1
+ step roll: ROLLBACK;
+ step trunc: TRUNCATE TABLE foo;
+
+ starting permutation: begin select trunc roll
+ step begin:
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ -- Force snapshot open
+ SELECT 1 AS foo FROM txid_current();
+
+ foo
+
+ 1
+ step select: SELECT * FROM foo;
+ i
+
+ 1
+ step trunc: TRUNCATE TABLE foo; <waiting ...>
+ step roll: ROLLBACK;
+ step trunc: <... completed>
+
+ starting permutation: begin trunc select roll
+ step begin:
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ -- Force snapshot open
+ SELECT 1 AS foo FROM txid_current();
+
+ foo
+
+ 1
+ step trunc: TRUNCATE TABLE foo;
+ step select: SELECT * FROM foo;
+ ERROR: canceling statement due to conflict with TRUNCATE TABLE on foo
+ step roll: ROLLBACK;
+
+ starting permutation: trunc begin select roll
+ step trunc: TRUNCATE TABLE foo;
+ step begin:
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ -- Force snapshot open
+ SELECT 1 AS foo FROM txid_current();
+
+ foo
+
+ 1
+ step select: SELECT * FROM foo;
+ i
+
+ step roll: ROLLBACK;
diff --git a/src/test/isolation/specs/truncate.spec b/src/test/isolation/specs/truncate.spec
new file mode 100644
index ...2c7b707
*** a/src/test/isolation/specs/truncate.spec
--- b/src/test/isolation/specs/truncate.spec
***************
*** 0 ****
--- 1,23 ----
+ setup
+ {
+ CREATE TABLE foo (i int);
+ INSERT INTO foo VALUES (1);
+ }
+
+ teardown
+ {
+ DROP TABLE foo;
+ }
+
+ session "s1"
+ step "begin"
+ {
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ -- Force snapshot open
+ SELECT 1 AS foo FROM txid_current();
+ }
+ step "select" { SELECT * FROM foo; }
+ step "roll" { ROLLBACK; }
+
+ session "s2"
+ step "trunc" { TRUNCATE TABLE foo; }
Simon Riggs wrote:
Kevin Grittner wrote:Simon Riggs wrote:
I like Marti's idea. At present, making his idea work could
easily mean checksums sink
For my part, improving bulk load performance and TRUNCATE
transactional semantics would trump checksums
It doesn't look like it needs to be either-or.
Great news!
Please review the safe_truncate.v2.patch and
copy_autofrozen.v359.patch, copied here to assist testing and
inspection.
I'll look at it today.
At present those patches handle only the TRUNCATE/COPY optimisation
but we could easily add CTAS, CREATE/COPY, CLUSTERVACUUM FULL etc..
CREATE/COPY would be important so that pg_dump | psql -1 would
benefit.
-Kevin
Import Notes
Resolved by subject fallback